Gene-expression profiling with reduced numbers of transcript measurements

ABSTRACT

The present invention provides compositions and methods for making and using a transcriptome-wide gene-expression profiling platform that measures the expression levels of only a select subset of the total number of transcripts. Because gene expression is believed to be highly correlated, direct measurement of a small number (for example, 1,000) of appropriately-selected transcripts allows the expression levels of the remainder to be inferred. The present invention, therefore, has the potential to reduce the cost and increase the throughput of full-transcriptome gene-expression profiling relative to the well-known conventional approaches that require all transcripts to be measured.

RELATED APPLICATIONS AND INCORPORATION BY REFERENCE

This application is a continuation of U.S. patent application Ser. No.:13/646,294, filed on Oct. 5, 2012, which is now U.S. Pat. No.10,619,195, issued on Apr. 14, 2020; which is a continuation-in-part ofPCT/US2011/031395, filed Apr. 6, 2011, which published as PCTPublication No. WO 2011/127150 on 13 Oct. 2011, which claims benefit ofU.S. Provisional Application Serial No. 61/321, 298, filed 6 Apr. 2010.

This application is also a continuation of U.S. patent application Ser.No.: 13/646,294, filed on Oct. 5, 2012, which is now U.S. Pat. No.10,619,195, issued on Apr. 14, 2020; which is a continuation-in-part ofInternational Patent Application Serial No. PCT/US2011/031232, filed 5Apr. 2011, which published as PCT Publication No. WO 2011/127042 on 13Oct. 2011, which claims benefit of U.S. Provisional Patent ApplicationSer. No. 61/321,385, filed 6 Apr. 2010. The entire contents of which areincorporated herein by reference.

The foregoing applications, and all documents cited therein or duringtheir prosecution (“appin cited documents”) and all documents cited orreferenced in the appin cited documents, and all documents cited orreferenced herein (“herein cited documents”), and all documents cited orreferenced in herein cited documents, together with any manufacturer'sinstructions, descriptions, product specifications, and product sheetsfor any products mentioned herein or in any document incorporated byreference herein, are hereby incorporated herein by reference, and maybe employed in the practice of the invention. More specifically, allreferenced documents are incorporated by reference to the same extent asif each individual document was specifically and individually indicatedto be incorporated by reference.

FEDERAL FUNDING LEGEND

This invention was made with government support under Grant Nos.CA133834 and U54 6916636 awarded by the National Institutes of Health.The government has certain rights in the invention.

FIELD OF THE INVENTION

The present invention relates to genomic informatics and gene-expressionprofiling. Gene-expression profiles provide complex molecularfingerprints regarding the relative state of a cell or tissue.Similarities in gene-expression profiles between organic states (i.e.,for example, normal and diseased cells and/or tissues) provide moleculartaxonomies, classification, and diagnostics. Similarities ingene-expression profiles resulting from various external perturbations(i.e., for example, ablation or enforced expression of specific genes,and/or small molecules, and/or environmental changes) reveal functionalsimilarities between these perturbagens, of value in pathway andmechanism-of-action elucidation. Similarities in gene- expressionprofiles between organic (e.g. disease) and induced (e.g. by smallmolecule) states may identify clinically-effective therapies.Improvements described herein allow for the efficient and economicalgeneration of full-transcriptome gene-expression profiles by identifyingcluster centroid landmark transcripts that predict the expression levelsof other transcripts within the same cluster.

BACKGROUND OF THE INVENTION

High-density, whole-transcriptome DNA microarrays are the method ofchoice for unbiased gene-expression profiling. These profiles have beenfound useful for the classification and diagnosis of disease, predictingpatient response to therapy, exploring biological mechanisms, inclassifying and elucidating the mechanisms-of-action of small molecules,and in identifying new therapeutics. van de Vijver et al., “A geneexpression signature as a predictor of survival in breast cancer” N EnglJ Med 347:1999-2009 (2002); Lamb et al., “A mechanism of cyclin D1action encoded in the patterns of gene expression in human cancer” Cell114:323-334 (2003); Glas et al., “Gene expression profiling infollicular lymphoma to assess clinical aggressiveness and to guide thechoice of treatment” Blood 105:301-307 (2005); Burczynski et al.,“Molecular classification of Crohn's disease and ulcerative colitispatients using transcriptional profiles in peripheral blood mononuclearcells” J Mol Diagn 8:51-61 (2006); Golub et al., “Molecularclassification of cancer: class discovery and class prediction by geneexpression monitoring” Science 286:531 (1999); Ramaswamy et al.,“Multiclass cancer diagnosis using tumor gene expression signatures”Proc Natl Acad Sci 98: 15149 (2001); Lamb et al., “The Connectivity Map:using gene-expression signatures to connect small molecules, genes anddisease” Science 313:1929 (2006). However, the overall success andwide-spread use of these methods is severely limited by the high costand low throughput of existing transcriptome-analysis technologies. Forexample, using gene-expression profiling to screen for small moleculeswith desirable biological effects is practical only if one could analyzethousands of compounds per day at a cost dramatically below that ofconventional microarrays.

What is needed in the art is a simple, flexible, cost-effective, andhigh-throughput transcriptome-wide gene-expression profiling solutionthat would allow for the analysis of many thousands of tissue specimensand cellular states induced by external perturbations. This wouldgreatly accelerate the rate of discovery of medically-relevantconnections encoded therein. Methods have been developed to rapidlyassay the expression of small numbers of transcripts in large number ofsamples; for example, Peck et al., “A method for high-throughput geneexpression signature analysis” Genome Biol 7:R61 (2006). If transcriptsthat faithfully predict the expression levels of other transcripts couldbe identified, it is conceivable that the measurement of a set of such‘landmark’ transcripts using such moderate-muliplex assay methods could,in concert with an algorithm that calculates the levels of thenon-landmark transcripts from those measurements, provide thefull-transcriptome gene-expression analysis solution sought.

Citation or identification of any document in this application is not anadmission that such document is available as prior art to the presentinvention.

SUMMARY OF THE INVENTION

The present invention is related to the field of genomic informatics andgene-expression profiling. Gene-expression profiles provide complexmolecular fingerprints regarding the relative state of a cell or tissue.Similarities in gene-expression profiles between organic states (i.e.,for example, normal and diseased cells and/or tissues) provide moleculartaxonomies, classification, and diagnostics. Similarities ingene-expression profiles resulting from various external perturbations(i.e., for example, ablation or enforced expression of specific genes,and/or small molecules, and/or environmental changes) reveal functionalsimilarities between these perturbagens, of value in pathway andmechanism-of-action elucidation. Similarities in gene-expressionprofiles between organic (e.g. disease) and induced (e.g. by smallmolecule) states may identify clinically-effective therapies.Improvements described herein allow for the efficient and economicalgeneration of full-transcriptome gene-expression profiles by identifyingcluster centroid landmark transcripts that predict the expression levelsof other transcripts within the same cluster.

In one embodiment, the present invention contemplates a method formaking a transcriptome-wide mRNA-expression profiling platform usingsub-transcriptome numbers of transcript measurements which may comprise:a) providing: i) a first library of transcriptome-wide mRNA-expressiondata from a first collection of biological samples; ii) a secondcollection of biological samples; iii) a second library oftranscriptome-wide mRNA-expression data from said second collection ofbiological samples; iv) a device capable of measuring transcriptexpression levels; b) performing computational analysis on said firstlibrary such that a plurality of transcript clusters are created,wherein the number of said clusters is substantially less than the totalnumber of all transcripts; c) identifying a centroid transcript withineach of said plurality of transcript clusters, thereby creating aplurality of centroid transcripts, said remaining transcripts beingnon-centroid transcripts; d) measuring the expression levels of at leasta portion of transcripts from said second collection of biologicalsamples with said device, wherein said portion of transcripts comprisetranscripts identified as said centroid transcripts from said firstlibrary; e) determining the ability of said measurements of theexpression levels of said centroid transcripts to infer the levels of atleast a portion of transcripts from said second library, wherein saidportion is comprised of non-centroid transcripts; f) selecting saidcentroid transcripts whose said expression levels have said ability toinfer the levels of said portion of non-centroid transcripts. In oneembodiment, the plurality of centroid transcripts is approximately 1000centroid transcripts. In one embodiment, the device is selected from thegroup which may comprise a microarray, a bead array, a liquid array, ora nucleic-acid sequencer. In one embodiment, the computational analysismay comprise cluster analysis. In one embodiment, the method further maycomprise repeating steps c) to f) until validated centroid transcriptsfor each of said plurality of transcript clusters are identified. In oneembodiment, the plurality of clusters of transcripts are orthogonal. Inone embodiment, the plurality of clusters of transcripts arenon-overlapping. In one embodiment, the determining involves acorrelation between said expression levels of said centroid transcriptsand said expression levels of said non-centroid transcripts. In oneembodiment, the expression levels of a set of substantially invarianttranscripts are additionally measured with said device in said secondcollection of biological samples. In one embodiment, the measurements ofsaid centroid transcripts made with said device, and saidmRNA-expression data from said first and second libraries, arenormalized with respect to the expression levels of a set ofsubstantially invariant transcripts.

In one embodiment, the present invention contemplates a method foridentifying a subpopulation of predictive transcripts within atranscriptome, which may comprise: a) providing; i) a first library oftranscriptome-wide mRNA-expression data from a first collection ofbiological samples; ii) a second collection of biological samples or asecond library of transcriptome-wide mRNA-expression data from saidsecond collection of biological samples; iii) a device capable ofmeasuring transcript expression levels; b) performing computationalanalysis on said first library such that a plurality of transcriptclusters are created, wherein the number of said clusters is less thanthe total number of all transcripts in said first library; c)identifying a centroid transcript within each of said transcriptclusters thereby creating a plurality of centroid transcripts, saidremaining transcripts being non-centroid transcripts; d) processingtranscripts from said second collection of biological samples on saiddevice so as to measure expression levels of said centroid transcripts,and e) determining which of said plurality of centroid transcriptsmeasured on said device predict the levels of said non-centroidtranscripts in said second library of transcriptome-wide data. In oneembodiment, the plurality of centroid transcripts is approximately 1000centroid transcripts. In one embodiment, the device is selected from thegroup which may comprise a microarray, a bead array, a liquid array, ora nucleic-acid sequencer. In one embodiment, the computational analysismay comprise cluster analysis. In one embodiment, the determininginvolves a correlation between said centroid transcript and saidnon-centroid transcript. In one embodiment, the method further maycomprise repeating steps c) to e).

In one embodiment, the present invention contemplates a method foridentifying a subpopulation of approximately 1000 predictive transcriptswithin a transcriptome, which may comprise: a) providing: i) a firstlibrary of transcriptome-wide mRNA-expression data from a firstcollection of biological samples representing greater than 1000different transcripts, and ii) transcripts from a second collection ofbiological samples; b) performing computational analysis on said firstlibrary such that a plurality of clusters of transcripts are created,wherein the number of said clusters is approximately 1000 and less thanthe total number of all transcripts in said first library; c)identifying a centroid transcript within each of said transcriptclusters, said remaining transcripts being non-centroid transcripts; d)processing the transcripts from said second collection of biologicalsamples so as to measure the expression levels of non-centroidtranscripts, so as to create first measurements, and expression levelsof centroid transcripts, so as to create second measurements; and e)determining which centroid transcripts based on said second measurementspredict the levels of said non-centroid transcripts, based on said firstmeasurements, thereby identifying a subpopulation of predictivetranscripts within a transcriptome. In one embodiment, the methodfurther may comprise a device capable of measuring the expression levelsof said centroid transcripts. In one embodiment, the device is capableof measuring the expression levels of approximately 1000 of saidcentroid transcripts. In one embodiment, the computational analysis maycomprise cluster analysis. In one embodiment, the determining involves acorrelation between said centroid transcript and said non-centroidtranscript. In one embodiment, the method further may comprise repeatingsteps c) to e).

In one embodiment, the present invention contemplates a method forpredicting the expression level of a first population of transcripts bymeasuring the expression level of a second population of transcripts,which may comprise: a) providing: i) a first heterogeneous population oftranscripts which may comprise a second heterogeneous population oftranscripts, said second population which may comprise a subset of saidfirst population, ii) an algorithm capable of predicting the level ofexpression of transcripts within said first population which are notwithin said second population, said predicting based on the measuredlevel of expression of transcripts within said second population; b)processing said first heterogeneous population of transcripts underconditions such that a plurality of different templates representingonly said second population of transcripts is created; c) measuring theamount of each of said different templates to create a plurality ofmeasurements; and d) applying said algorithm to said plurality ofmeasurements, thereby predicting the level of expression of transcriptswithin said first population which are not within said secondpopulation. In one embodiment, the first heterogenous population oftranscripts comprise a plurality of non-centroid transcripts. In oneembodiment, the second heterogenous population of transcripts maycomprise a plurality of centroid transcripts. In one embodiment, themethod further may comprise a device capable of measuring the amount ofapproximately 1000 of said different templates. In one embodiment, thedevice is selected from the group which may comprise a microarray, abead array, a liquid array, or a nucleic-acid sequencer. In oneembodiment, the algorithm involves a dependency matrix.

In one embodiment, the present invention contemplates a method ofassaying gene expression, which may comprise: a) providing: i)approximately 1000 different barcode sequences; ii) approximately 1000beads, each bead which may comprise a homogeneous set of nucleic-acidprobes, each set complementary to a different barcode sequence of saidapproximately 1000 barcode sequences; iii) a population of more than1000 different transcripts, each transcript which may comprise agene-specific sequence; iv) an algorithm capable of predicting the levelof expression of unmeasured transcripts; b) processing said populationof transcripts to create approximately 1000 different templates, eachtemplate which may comprise one of said approximately 1000 barcodesequences operably associated with a different gene-specific sequence,wherein said approximately 1000 different templates represents less thanthe total number of transcripts within said population; c) measuring theamount of each of said approximately 1000 different templates to createa plurality of measurements; and d) applying said algorithm to saidplurality of measurements, thereby predicting the level of expression ofunmeasured transcripts within said population. In one embodiment, themethod further may comprise a device capable of measuring the amount ofeach of said approximately 1000 different templates. In one embodiment,the beads are optically addressed. In one embodiment, the processing maycomprise ligation-mediated amplification. In one embodiment, themeasuring may comprise detecting said optically addressed beads. In oneembodiment, the measuring may comprise hybridizing said approximately1000 different templates to said approximately 1000 beads through saidnucleic-acid probes complementary to said approximately 1000 barcodesequences. In one embodiment, the measuring may comprise a flowcytometer. In one embodiment, the algorithm involves a dependencymatrix.

In one embodiment, the present invention contemplates a compositionwhich may comprise an amplified nucleic acid sequence, wherein saidsequence may comprise at least a portion of a cluster centroidtranscript sequence and a barcode sequence, wherein said compositionfurther may comprise an optically addressed bead, and wherein said beadmay comprise a capture probe nucleic-acid sequence hybridized to saidbarcode. In one embodiment, the barcode sequence is at least partiallycomplementary to said capture probe nucleic acid. In one embodiment, theamplified nucleic-acid sequence is biotinylated. In one embodiment, theoptically addressed bead is detectable with a flow cytometric system. Inone embodiment, the flow cytometric system discriminates betweenapproximately 500-1000 optically addressed beads.

In one embodiment, the present invention contemplates a method forcreating a genome-wide expression profile, which may comprise: a)providing; i) a plurality of genomic transcripts derived from abiological sample; ii) a plurality of centroid transcripts which maycomprise at least a portion of said genomic transcripts, said remaininggenomic transcripts being non-centroid transcripts; b) measuring theexpression level of said plurality of centroid transcripts; c) inferringthe expression levels of said non-centroid transcripts from saidcentroid transcript expression levels, thereby creating a genome-wideexpression profile. In one embodiment, the plurality of centroidtranscripts comprise approximately 1,000 transcripts. In one embodiment,the measuring may comprise a device selected from the group which maycomprise a microarray, a bead array, a liquid array, or a nucleic-acidsequencer. In one embodiment, the inferring involves a dependencymatrix, the genome-wide expression profile identifies said biologicalsample as diseased. In one embodiment, the genome-wide expressionprofile identifies said biological sample as healthy. In one embodiment,the genome-wide expression profile provides a functional readout of theaction of a perturbagen. In one embodiment, the genome-wide expressionprofile may comprise an expression profile suitable for use in aconnectivity map. In one embodiment, the expression profile is comparedwith query signatures for similarities. In one embodiment, thegenome-wide expression profile may comprise a query signature compatiblewith a connectivity map. In one embodiment, the query signature iscompared with known genome-wide expression profiles for similarities.

In one embodiment, the present invention contemplates a kit, which maycomprise: a) a first container which may comprise a plurality ofcentroid transcripts derived from a transcriptome; b) a second containerwhich may comprise buffers and reagents compatible with measuring theexpression level of said plurality of centroid transcripts within abiological sample; c) a set of instructions for inferring the expressionlevel of non-centroid transcripts within said biological sample, basedupon the expression level of said plurality of centroid transcripts. Inone embodiment, the plurality of centroid transcripts is approximately1,000 transcripts.

In one embodiment, the present invention contemplates a method formaking a transcriptome-wide mRNA-expression profile, which may comprise:a) providing: i) a composition of validated centroid transcriptsnumbering substantially less than the total number of all transcripts;ii) a device capable of measuring the expression levels of saidvalidated centroid transcripts; iii) an algorithm capable ofsubstantially calculating the expression levels of transcripts notamongst the set of said validated centroid transcripts from expressionlevels of said validated centroid transcripts measured by said deviceand transcript cluster information created from a library oftranscriptome-wide mRNA-expression data from a collection of biologicalsamples; and iv) a biological sample; b) applying said biological sampleto said device whereby expression levels of said validated centroidtranscripts in said biological sample are measured; and c) applying saidalgorithm to said measurements thereby creating a transcriptome-widemRNA expression profile. In one embodiment, the validated centroidtranscripts comprise approximately 1,000 transcripts. In one embodiment,the device is selected from the group which may comprise a microarray, abead array, a liquid array, or a nucleic-acid sequencer. In oneembodiment, the expression levels of a set of substantially invarianttranscripts are additionally measured in said biological sample. In oneembodiment, the expression levels of said validated centroid transcriptsare normalized with respect to said expression levels of said invarianttranscripts.

In one embodiment, the present invention contemplates a method formaking a transcriptome-wide mRNA-expression profiling platform which maycomprise: a) providing: i) a first library of transcriptome-widemRNA-expression data from a first collection of biological samples; ii)a second library of transcriptome-wide mRNA-expression data from asecond collection of biological samples; iii) a device capable ofmeasuring transcript expression levels; b) performing computationalanalysis on said first library such that a plurality of transcriptclusters are created, wherein the number of said clusters issubstantially less than the total number of all transcripts; c)identifying a centroid transcript within each of said plurality oftranscript clusters, thereby creating a plurality of centroidtranscripts; d) identifying a set of substantially invariant transcriptsfrom said first library; e) measuring the expression levels of at leasta portion of transcripts from said second collection of biologicalsamples with said device, wherein said portion of transcripts comprisetranscripts identified as said centroid transcripts and said invarianttranscripts from said first library; f) determining the ability of saidmeasurements of expression levels of said plurality of centroidtranscripts to infer the levels of at least a portion of non-centroidtranscripts from said second library. In one embodiment, the pluralityof centroid transcripts is approximately 1000 centroid transcripts. Inone embodiment, the device may comprise a genome-wide microarray. In oneembodiment, the method further may comprise repeating steps c) to f)until validated centroid transcripts for each of said plurality oftranscript clusters are identified. In one embodiment, the plurality ofclusters of transcripts are orthogonal. In one embodiment, the pluralityof clusters of transcripts are non-overlapping.

In one embodiment, the present invention contemplates a method forpredicting transcript levels within a transcriptome, which may comprise:a) providing: i) a first library of transcriptome-wide mRNA-expressiondata from a first collection of biological samples; ii) a second libraryof transcriptome-wide mRNA-expression data from a second collection ofbiological samples; iii) a device capable of measuring transcriptexpression levels; b) performing computational analysis on said firstlibrary such that a plurality of transcript clusters are created,wherein the number of said clusters is less than the total number of alltranscripts in said first library; c) identifying a centroid transcriptwithin each of said transcript clusters thereby creating a plurality ofcentroid transcripts, said remaining transcripts being non-centroidtranscripts; d) processing said second library transcripts on saiddevice so as to measure expression levels of said centroid transcriptsand e) determining which of said plurality of centroid transcriptsmeasured on said device predict the levels of said non-centroidtranscripts in said second library of transcriptome-wide data. In oneembodiment, the plurality of centroid transcripts is approximately 1000centroid transcripts. In one embodiment, the device is selected from thegroup which may comprise a microarray, a bead array, or a liquid array.In one embodiment, the computational analysis may comprise clusteranalysis. In one embodiment, the identifying may comprise repeatingsteps c) to e). In one embodiment, the processing utilizes a flowcytometer. In one embodiment, the determining identifies a correlationbetween said centroid transcript and said non-centroid transcript.

In one embodiment, the present invention contemplates a method formaking a transcriptome-wide mRNA-expression profiling platform which maycomprise: a) providing: i) a first library of transcriptome-widemRNA-expression data from a first collection of biological samples; ii)a second collection of biological samples; iii) a second library oftranscriptome-wide mRNA-expression data from said second collection ofbiological samples; iv) a device capable of measuring transcriptexpression levels; b) performing computational analysis on said firstlibrary such that a plurality of transcript clusters are created,wherein the number of said clusters is substantially less than the totalnumber of all transcripts; c) identifying a centroid transcript withineach of said plurality of transcript clusters, thereby creating aplurality of centroid transcripts; d) measuring the expression levels ofat least a portion of transcripts from said second collection ofbiological samples with said device, wherein said portion of transcriptscomprise transcripts identified as said centroid transcripts from saidfirst library; e) determining the ability of said measurements of theexpression levels of said centroid transcripts to infer the levels of atleast a portion of transcripts from said second library, wherein saidportion is comprised of non-centroid transcripts. In one embodiment, theplurality of centroid transcripts is approximately 1000 centroidtranscripts. In one embodiment, the device may comprise a microarray. Inone embodiment, the device may comprise a bead array. In one embodiment,the device may comprise a liquid array. In a the method further maycomprise repeating steps c) to e) until validated centroid transcriptsfor each of said plurality of transcript clusters are identified. In oneembodiment, the plurality of clusters of transcripts are orthogonal. Inone embodiment, the plurality of clusters of transcripts arenon-overlapping. In one embodiment, the determining involves acorrelation between said centroid transcripts and said non-centroidtranscripts. In one embodiment, the expression levels of a set ofsubstantially invariant transcripts are additionally measured with saiddevice in said second collection of biological samples. In oneembodiment, the measurements of said centroid transcripts made with saiddevice, and said mRNA-expression data from said first and secondlibraries, are normalized with respect to the expression levels of a setof substantially invariant transcripts.

In one embodiment, the present invention contemplates a method foridentifying a subpopulation of approximately 1000 predictive transcriptswithin a transcriptome, which may comprise: a) providing i) a firstlibrary of transcriptome-wide mRNA-expression data from a firstcollection of biological samples representing greater than 1000different transcripts, and ii) transcripts from a second collection ofbiological samples; b) performing computational analysis on said firstlibrary such that a plurality of clusters of transcripts are created,wherein the number of said clusters is approximately 1000 and less thanthe total number of all transcripts in said first library; c)identifying a centroid transcript within each of said transcriptclusters, said remaining transcripts being non-centroid transcripts; d)processing the transcripts from said second collection of biologicalsamples so as to measure the expression levels of non-centroidtranscripts, so as to create first measurements, and expression levelsof centroid transcripts, so as to create second measurements; and e)determining which centroid transcripts based on said second measurementspredict the levels of said non-centroid transcripts, based on said firstmeasurements, thereby identifying a subpopulation of predictivetranscripts within a transcriptome. In one embodiment, the methodfurther may comprise a device capable of attaching said centroidtranscripts. In one embodiment, the device attaches approximately 1000of said centroid transcripts. In one embodiment, the computationalanalysis may comprise cluster analysis. In one embodiment, theidentifying may comprise repeating steps c) to e). In one embodiment,the processing utilizes a flow cytometer. In one embodiment, thedetermining identifies a correlation between said centroid transcriptand said non-centroid transcript.

In one embodiment, the present invention contemplates a method forpredicting the expression level of a first population of transcripts bymeasuring the expression level of a second population of transcripts,which may comprise: a) providing; i) a first heterogeneous population oftranscripts which may comprise a second heterogeneous population oftranscripts, said second population which may comprise a subset of saidfirst population, ii) an algorithm capable of predicting the level ofexpression of transcripts within said first population which are notwithin said second population, said predicting based on the measuredlevel of expression of transcripts within said second population; b)processing said first heterogeneous population of transcripts underconditions such that a plurality of different templates representingonly said second population of transcripts is created; c) measuring theamount of each of said different templates to create a plurality ofmeasurements; and d) applying said algorithm to said plurality ofmeasurements, thereby predicting the level of expression of transcriptswithin said first population which are not within said secondpopulation. In one embodiment, the first heterogenous population oftranscripts comprise a plurality of non-centroid transcripts. In oneembodiment, the second heterogenous population of transcripts maycomprise a plurality of centroid transcripts. In one embodiment, themethod further may comprise a device capable of attaching approximately1000 of said centroid transcripts. In one embodiment, the measuring maycomprise a flow cytometer. In one embodiment, the applying saidalgorithm identifies a correlation between said centroid transcript andsaid non-centroid transcript.

In one embodiment, the present invention contemplates a method ofassaying gene expression, which may comprise: a) providing i)approximately 1000 different barcode sequences; ii) approximately 1000beads, each bead which may comprise a homogeneous set of nucleic acidprobes, each set complementary to a different barcode sequence of saidapproximately 1000 barcode sequences; iii) a population of more than1000 different transcripts, each transcript which may comprise a genespecific sequence; iv) an algorithm capable of predicting the level ofexpression of unmeasured transcripts; b) processing said population oftranscripts to create approximately 1000 different templates, eachtemplate which may comprise one of said approximately 1000 barcodesequences operably associated with a different gene specific sequence,wherein said approximately 1000 different templates represents less thanthe total number of transcripts within said population; c) measuring theamount of each of said approximately 1000 different templates to createa plurality of measurements; and d) applying said algorithm to saidplurality measurements, thereby predicting the level of expression ofunmeasured transcripts within said population. In one embodiment, themethod further may comprise a device capable of attaching approximately1000 of said centroid transcripts. In one embodiment, the processing maycomprise ligation mediated amplification. In one embodiment, the beadsare optically addressable. In one embodiment, the measuring may comprisedetecting said optically addressable beads. In one embodiment, theapplying said algorithm may comprise identifying a correlation betweensaid measured transcripts and said unmeasured transcripts.

In one embodiment, the present invention contemplates a compositionwhich may comprise an amplified nucleic acid sequence, wherein saidsequence may comprise at least a portion of a cluster centroid landmarktranscript sequence and a barcode sequence, wherein said compositionfurther may comprise an optically addressable bead, and wherein saidbead may comprise a capture probe nucleic acid sequence hybridized tosaid barcode. In one embodiment, the barcode sequence is at leastpartially complementary to said capture probe nucleic acid. In oneembodiment, the optically addressable bead is color coded. In oneembodiment, the amplified nucleic acid sequence is biotinylated. In oneembodiment, the optically addressable bead is detectable with a flowcytometric system. In one embodiment, the flow cytometric systemsimultaneously differentiates between approximately 500-1000 opticallyaddressable beads.

In one embodiment, the present invention contemplates a method forcreating a genome-wide expression profile, which may comprise: a)providing; i) a plurality of genomic transcripts derived from abiological sample; and ii) a plurality of centroid transcripts which maycomprise at least a portion of said genomic transcripts, said remaininggenomic transcripts being non-centroid transcripts; b) measuring theexpression of said plurality of centroid transcripts; c) inferring theexpression levels of said non-centroid transcripts from said centroidtranscript expression, thereby creating a genome wide expressionprofile. In one embodiment, the plurality of centroid transcriptscomprise approximately 1,000 transcripts. In one embodiment, thegenome-wide expression profile identifies said biological sample asdiseased. In one embodiment, the genome-wide expression profileidentifies said biological sample as healthy. In one embodiment, thegenome-wide expression profile may comprise a query signature compatiblewith a connectivity map. In one embodiment, the query signature iscompared with known genome-wide expression profiles for similarities.

In one embodiment, the present invention contemplates a method foridentifying a subpopulation of predictive transcripts within atranscriptome, which may comprise: a) providing i) a device to measurethe expression level of transcripts, ii) a first library oftranscriptome-wide mRNA-expression data from a first collection ofbiological samples, and iii) transcripts from a second collection ofbiological samples; b) performing computational analysis on said firstlibrary such that a plurality of clusters of transcripts are created,wherein the number of said clusters is less than the total number of alltranscripts in said first library; c) identifying a centroid transcriptwithin each of said transcript clusters, said remaining transcriptsbeing non-centroid transcripts; d) processing the transcripts from saidsecond collection of biological samples so as to measure, with saiddevice, the expression levels of non-centroid transcripts, so as tocreate first measurements, and expression levels of centroidtranscripts, so as to create second measurements; and e) determiningwhich centroid transcripts based on said second measurements predict thelevels of said non-centroid transcripts, based on said firstmeasurements, thereby identifying a subpopulation of predictivetranscripts within a transcriptome. In one embodiment, the device maycomprise a microarray. In one embodiment, the computational analysis maycomprise cluster analysis. In one embodiment, the identifying maycomprise an iterative validation algorithm. In one embodiment, theprocessing utilizes a cluster dependency matrix. In one embodiment, thedetermining identifies a dependency matrix between said centroidtranscript and said non-centroid transcript..

In one embodiment, the present invention contemplates a method foridentifying a subpopulation of approximately 1000 predictive transcriptswithin a transcriptome, which may comprise: a) providing i) a device tomeasure the expression level of transcripts, ii) a first library oftranscriptome-wide mRNA-expression data from a first collection ofbiological samples representing greater than 1000 different transcripts,and iii) transcripts from a second collection of biological samples; b)performing computational analysis on said first library such that aplurality of clusters of transcripts are created, wherein the number ofsaid clusters is approximately 1000 and less than the total number ofall transcripts in said first library; c) identifying a centroidtranscript within each of said transcript clusters, said remainingtranscripts being non-centroid transcripts; d) processing thetranscripts from said second collection of biological samples so as tomeasure, with said device, the expression levels of non-centroidtranscripts, so as to create first measurements, and expression levelsof centroid transcripts, so as to create second measurements; and e)determining which centroid transcripts based on said second measurementspredict the levels of said non-centroid transcripts, based on said firstmeasurements, thereby identifying a subpopulation of predictivetranscripts within a transcriptome. In one embodiment, the device maycomprise a microarray. In one embodiment, the computational analysis maycomprise cluster analysis. In one embodiment, the identifying maycomprise an iterative validation algorithm. In one embodiment, theprocessing utilizes a cluster dependency matrix. In one embodiment, thedetermining identifies a dependency matrix between said centroidtranscript and said non-centroid transcript.

In one embodiment, the present invention contemplates a method forpredicting the expression level of a first population of transcripts bymeasuring the expression level of a second population of transcripts,which may comprise: a) providing i) a first heterogeneous population oftranscripts which may comprise a second heterogeneous population oftranscripts, said second population which may comprise a subset of saidfirst population, ii) a device, iii) an algorithm capable of predictingthe level of expression of transcripts within said first populationwhich are not within said second population, said predicting based onthe measured level of expression of transcripts within said secondpopulation; b) processing said first heterogeneous population oftranscripts under conditions such that a plurality of differenttemplates representing only said second population of transcripts iscreated; c) measuring the amount of each of said different templateswith said device to create a plurality of measurements; and d) applyingsaid algorithm to said plurality of measurements, thereby predicting thelevel of expression of transcripts within said first population whichare not within said second population. In one embodiment, the firstheterogenous population of transcripts comprise a plurality ofnon-centroid transcripts. In one embodiment, the second heterogenouspopulation of transcripts may comprise a plurality of centroidtranscripts. In one embodiment, the device may comprise a microarray. Inone embodiment, the processing may comprise computations selected fromthe group consisting of dimensionality reduction and cluster analysis.In one embodiment, the applying said algorithm identifies a dependencymatrix between said centroid transcript and said non-centroidtranscript.

In one embodiment, the present invention contemplates a method ofassaying gene expression, which may comprise: a) providing i)approximately 1000 different barcode sequences; ii) approximately 1000beads, each bead which may comprise a homogeneous set of nucleic acidprobes, each set complementary to a different barcode sequence of saidapproximately 1000 barcode sequences; iii) a population of more than1000 different transcripts, each transcript which may comprise a genespecific sequence; iv) a device; and v) an algorithm capable ofpredicting the level of expression of unmeasured transcripts; b)processing said population of transcripts to create approximately 1000different templates, each template which may comprise one of saidapproximately 1000 barcode sequences operably associated with adifferent gene specific sequence, wherein said approximately 1000different templates represents less than the total number of transcriptswithin said population; c) measuring the amount of each of saidapproximately 1000 different templates with said device to create aplurality of measurements; and d) applying said algorithm to saidplurality measurements, thereby predicting the level of expression ofunmeasured transcripts within said population. In one embodiment, thedevice may comprise a microarray. In one embodiment, the processing maycomprise ligation mediated amplification. In one embodiment, the beadsare optically addressable. In one embodiment, the measuring may comprisedetecting said optically addressable beads. In one embodiment, theapplying said algorithm identifies a dependency matrix between saidmeasured transcripts and said unmeasured transcripts.

In one embodiment, the present invention contemplates a method formaking a transcriptome-wide mRNA-expression profiling platform which maycomprise a) providing a library of transcriptome-wide mRNA-expressiondata from a first collection of biological samples; b) performingcomputational analysis on said library such that a plurality of(orthogonal/non-overlapping) clusters of transcripts are created,wherein the number of said clusters is substantially less than the totalnumber of all transcripts; c) identifying a centroid transcript withineach of said transcript clusters; d) identifying a set of transcriptsfrom said transcriptome-wide mRNA-expression-data library whose levelsare substantially invariant across said first collection of biologicalsamples; e) providing a device to measure (simultaneously) the levels ofat least a portion of said centroid transcripts and said invarianttranscripts; f) determining the ability of said measurements ofcentroid-transcript levels made using said device to represent thelevels of other transcripts within its cluster from a second collectionof biological samples; and g) repeating steps c) to f) until validatedcentroid transcripts for each of said plurality of transcript clustersare identified.

In one embodiment, the present invention contemplates a method for usinga transcriptome-wide mRNA-expression profiling platform: a) providing:i) a composition of validated centroid transcripts numberingsubstantially less than the total number of all transcripts; ii) adevice capable of measuring the levels of said validated centroidtranscripts; iii) an algorithm capable of substantially calculating thelevels of transcripts not amongst the set of said validated centroidtranscripts from levels of said validated centroid transcripts measuredby said device and transcript cluster information created from a libraryof transcriptome-wide mRNA-expression data from a collection ofbiological samples; and iv) a biological sample; b) applying saidbiological sample to said device whereby levels of said validatedcentroid transcripts in said biological sample are measured; and c)applying said algorithm to said measurements thereby creating atranscriptome-wide mRNA expression profile.

The present invention is also related to compositions and methods forthe detection of analytes. Analytes capable of detection by thisinvention include, but are not limited to, nucleic acids, proteins,peptides, and/or small organic molecules (i.e., for example, inorganicand/or organic). Any particular analyte may be detected and/oridentified from a sample containing a plurality of other analytes.Further, the invention provides for a capability of simultaneouslydetecting and/or identifying all of the plurality of analytes containedwithin a sample (i.e., for example, a biological sample).

In one embodiment, the present invention contemplates a method, whichmay comprise: a) providing: i) a sample which may comprise a pluralityof analytes; ii) a plurality of solid substrate populations, whereineach of the solid substrate populations comprise a plurality of subsets,and wherein each subset is present in an unequal proportion from everyother subset in the same solid substrate population; iii) a plurality ofcapture probes capable of attaching to said plurality of analytes,wherein each subset may comprise a different capture probe; vi) a meansfor detecting said plurality of subsets that is capable of creating amultimodal intensity distribution pattern; b) detecting said pluralityof subsets with said means, wherein a multimodal intensity distributionpattern is created; c) identifying said plurality of analytes from saidmultimodal distribution pattern. In one embodiment, the sample may beselected from the group which may comprise a biological sample, a soilsample, or a water sample. In one embodiment, the plurality of analytesmay be selected from the group which may comprise nucleic acids,proteins, peptides, drugs, small molecules, biological receptors,enzymes, antibodies, polyclonal antibodies, monoclonal antibodies, orFab fragments. In one embodiment, the solid substrate population maycomprise a bead-set population. In one embodiment, the unequalproportions comprise two subsets in an approximate ratio of 1.25:0.75.In one embodiment, the unequal proportions comprise three subsets in anapproximate ratio of 1.25:1.00:0.75. In one embodiment, the unequalproportions comprise four subsets in an approximate ratio of1.25:1.00:0.75:0.50. In one embodiment, the unequal proportions comprisefive subsets in an approximate ratio of 1.50:1.25:1.00:0.75:0.50. In oneembodiment, the unequal proportions comprise six subsets in anapproximate ratio of 1.75:1.50:1.25:1.00:0.75:0.50. In one embodiment,the unequal proportions comprise seven subsets in an approximate ratioof 2.00:1.75:1.50:1.25:1.00:0.75:0.50. In one embodiment, the unequalproportions comprise eight subsets in an approximate ratio of2.00:1.75:1:50:1.25:1.00:0.75:0.50:0.25. In one embodiment, the unequalproportions comprise nine subsets in an approximate ratio of2.25:2.00:1.75:1.50:1.25:1.00:0.75:0.50:0.25. In one embodiment, theunequal proportions comprise ten subsets in an approximate ratio of2.5:2.25:2.00:1.75:1.50:1.25:1.00:0.75:0.50:0.25.

In one embodiment, the present invention contemplates a method, whichmay comprise: a) providing: i) a solid substrate population which maycomprise a first subset and a second subset, wherein the first subset ispresent in a first proportion and the second subset is present in asecond proportion; ii) a first analyte attached to said first subset;iii) a second analyte attached to said second subset; vi) a means fordetecting said first subset and second subset that is capable ofcreating a multimodal intensity distribution pattern; b) detecting saidfirst subset and said second subset with said means, wherein amultimodal intensity distribution pattern is created; and c) identifyingsaid first analyte and said second analyte from said multimodaldistribution pattern.

In one embodiment, the solid substrate population may comprise a label.In one embodiment, the label may comprise a mixture of at least twodifferent fluorophores. In one embodiment, the first proportion isdifferent from the second proportion. In one embodiment, the firstanalyte is attached to the first subset with a first capture probe. Inone embodiment, the second analyte is attached to the second subset witha second capture probe. In one embodiment, the multimodal intensitydistribution pattern may comprise a first peak corresponding to thefirst subset. In one embodiment, the multimodal intensity distributionpattern may comprise a second peak corresponding to the second subset.

In one embodiment, the present invention contemplates a method, whichmay comprise: a) providing: i) a solid substrate population which maycomprise a plurality of subsets; ii) a sample which may comprise aplurality of analytes, wherein at least one portion of the plurality ofanalytes comprise related analytes; and iii) a means for detecting saidsubsets that is capable of creating a multimodal intensity distributionpattern; b) attaching each of the related analyte portions to one of theplurality of subsets; c) detecting said plurality of subsets with saidmeans, wherein a multimodal intensity distribution pattern is created;and d) identifying said related analytes from said multimodaldistribution pattern. In one embodiment, the related analytes compriselinked genes.

In one embodiment, the present invention contemplates a method, whichmay comprise: a) providing: i) a solid substrate population which maycomprise a plurality of subsets; ii) a sample which may comprise aplurality of analytes, wherein at least one portion of the plurality ofanalytes comprise rare event analytes; and iii) a means for detectingsaid subsets that is capable of creating a multimodal intensitydistribution pattern; b) attaching a portion of said plurality ofanalytes which may contain one or more of the rare event analytes to oneof the plurality of subsets; c) detecting said plurality of subsets withsaid means, wherein a multimodal intensity distribution pattern iscreated; and d) determining if said rare event analytes occur in saidmultimodal distribution pattern. In one embodiment, the rare eventanalyte portion is present in approximately less than 0.01% of saidsample. In one embodiment, the rare event analyte may comprise a smallmolecule or drug. In one embodiment, the rare event analyte may comprisea nucleic acid mutation. In one embodiment, the rare event analyte maycomprise a diseased cell. In one embodiment, the rare event analyte maycomprise an autoimmune antibody. In one embodiment, the rare eventanalyte may comprise a microbe.

In one embodiment, the present invention contemplates a method, whichmay comprise: a) providing: i) a solid substrate population which maycomprise a plurality of subsets; ii) a sample which may comprise a firstlabeled analyte and a second labeled analyte; and iii) a means fordetecting said subsets that is capable of creating a multimodalintensity distribution pattern; b) attaching the first and secondlabeled analytes in an unequal proportion to one of the plurality ofsubsets; c) detecting said plurality of subsets with said means, whereina multimodal intensity distribution pattern is created; and d)identifying said first and second labeled analytes from said multimodaldistribution pattern. In one embodiment, the first labeled analyte maycomprise a normal cell. In one embodiment, the second labeled analytemay comprise a tumor cell. In one embodiment, the multimodal intensitydistribution pattern may comprise a first peak corresponding to thefirst labeled analyte. In one embodiment, the multimodal intensitydistribution pattern may comprise a second peak corresponding to thesecond labeled analyte. In one embodiment, the unequal proportion isequivalent to a ratio of the first and second peaks.

Accordingly, it is an object of the invention to not encompass withinthe invention any previously known product, process of making theproduct, or method of using the product such that Applicants reserve theright and hereby disclose a disclaimer of any previously known product,process, or method. It is further noted that the invention does notintend to encompass within the scope of the invention any product,process, or making of the product or method of using the product, whichdoes not meet the written description and enablement requirements of theUSPTO (35 U.S.C. § 112, first paragraph) or the EPO (Article 83 of theEPC), such that Applicants reserve the right and hereby disclose adisclaimer of any previously described product, process of making theproduct, or method of using the product.

It is noted that in this disclosure and particularly in the claimsand/or paragraphs, terms such as “comprises”, “comprised”, “comprising”and the like can have the meaning attributed to it in U.S. Patent law;e.g., they can mean “includes”, “included”, “including”, and the like;and that terms such as “consisting essentially of” and “consistsessentially of” have the meaning ascribed to them in U.S. Patent law,e.g., they allow for elements not explicitly recited, but excludeelements that are found in the prior art or that affect a basic or novelcharacteristic of the invention.

These and other embodiments are disclosed or are obvious from andencompassed by, the following Detailed Description.

BRIEF DESCRIPTION OF THE FIGURES

The following detailed description, given by way of example, but notintended to limit the invention solely to the specific embodimentsdescribed, may best be understood in conjunction with the accompanyingdrawings.

The file of this patent contains at least one drawing executed in color.Copies of this patent with color drawings will be provided by the Patentand Trademark Office upon request and payment of the necessary fee.

FIG. 1 presents exemplary simulated data depicting the clustering of PCAloadings of transcripts (purple dots) in the eigenspace by k-means toidentify k distinct clusters (gray circles). The transcript closest tothe mean of the cluster was selected as the ‘cluster centroid landmarktranscript’ (single red dots).

FIG. 2 presents exemplary results using Connectivity Map datademonstrating that approximately 80% of the connections observed between184 query signatures and gene-expression profiles produced by measuringapproximately 22,000 transcripts are recovered using gene-expressionprofiles created by measuring only approximately 1,000 transcripts andpredicted the expression levels of the remainder.

FIG. 3 presents one embodiment of a method for measuring the expressionlevels of multiple transcripts simultaneously using ligation-mediatedamplification and optically-addressed microspheres.

FIG. 4 presents exemplary data for normalized expression levels of arepresentative cluster centroid landmark transcript (217995 at: SQRDL)in 384 biological samples measured by LMF and Affymetrix microarray.

FIG. 5 presents exemplary data showing a simple (type 1) clustercentroid landmark transcript validation failure; circle. Axes arenormalized expression levels.

FIGS. 6A and 6B present exemplary data showing a complex (type 2)cluster centroid landmark transcript validation failure.

FIG. 6A: Plots of normalized expression levels for a representativevalidated transcript/probe pair (blue, 218039_at:NUSAP1) and arepresentative failed transcript/probe pair (orange, 217762_s_at:RAB31).

FIG. 6B: Histogram showing normalized expression levels for thevalidated transcript/probe pair from FIG. 6A (blue arrow) and itsassociated non-centroid transcripts (blue bars); and the failedtranscript/probe pair from FIG. 6A (orange arrow) and its associatednon-centroid transcripts (orange bars). Red crosses mark non-correlationof gene-expression levels.

FIG. 7 presents exemplary data comparing the performance of ConnectivityMap datasets populated with gene-expression profiles generated withAffymetrix microarrays reporting on approximately 22,000 transcripts(left), and a ligation-mediated amplification and Luminexoptically-addressed microsphere assay of 1,000 landmark transcripts withinference of the expression levels of the remaining transcripts (right).Both datasets were queried with an independent HDAC-inhibitor querysignature. The ‘bar views’ shown are constructed from 6,100 and 782horizontal lines, respectively, each representing individual treatmentinstances and ordered by connectivity score. All instances of theHDAC-inhibitor, vorinostat, are colored in black. Colors applied to theremaining instances reflect their connectivity scores (green, positive;gray, null; red, negative).

FIGS. 8A and 8B present exemplary data comparing consensus clusteringdendrograms of gene-expression profiles for human cell lines. FIG. 8A isa clustering dendogram generated with Affymetrix microarrays. FIG. 8B isa clustering dendogram generated by a landmark transcript measurementand inference method as contemplated herein. Tissue types are: CO=colon;LE=blood (leukemia); ME=skin (melanoma); CNS=brain (central nervoussystem); OV=ovary; and RE=kidney (renal).

DETAILED DESCRIPTION OF THE INVENTION

The term “device” as used herein, refers to any composition capable ofmeasuring expression levels of transcripts. For example, a device maycomprise a solid planar substrate capable of attaching nucleic acids(i.e., an oligonucleotide microarray). Alternatively, a device maycomprise a solution-based bead array, wherein nucleic acids are attachedto beads and detected using a flow cytometer. Alternatively, a devicemay comprise a nucleic-acid sequencer. In other examples, a device maycomprise a plurality of cluster centroid landmark transcripts ascontemplated by the present invention.

The term “capture probe” as used herein, refers to any molecule capableof attaching and/or binding to a nucleic acid (i.e., for example, abarcode nucleic acid). For example, a capture probe may be anoligonucleotide attached to a bead, wherein the oligonucleotide is atleast partially complementary to another oligonucleotide. Alternatively,a capture probe may comprise a polyethylene glycol linker, an antibody,a polyclonal antibody, a monoclonal antibody, an Fab fragment, abiological receptor complex, an enzyme, a hormone, an antigen, and/or afragment or portion thereof.

The term “LW” as used herein, refers to an acronym for any method thatcombines ligation-mediated amplification, optically-addressed andbarcoded microspheres, and flow cytometric detection. See Peck et al.,“A method for high-throughput gene expression signature analysis” GenomeBiol 7:R61 (2006).

The term “transcript” as used herein, refers to any product of DNAtranscription, generally characterized as mRNA. Expressed transcriptsare recognized as a reliable indicator of gene expression.

The term “gene-expression profile” as used herein, refers to any datasetrepresenting the expression levels of a significant portion of geneswithin the genome (i.e., for example, a transcriptome).

The term “centroid transcript” as used herein, refers to any transcriptthat is within the center portion, or is representative of, a transcriptcluster. Further, the expression level of a centroid transcript maypredict the expression levels of the non-centroid transcripts within thesame cluster.

The term “non-centroid transcript” as used herein, refers to anytranscript in a transcript cluster that is not a centroid transcript.The expression level of a non-centroid transcript may be predicted(e.g., inferred) by the expression levels of centroid transcripts.

The term “cluster centroid landmark transcript” as used herein, refersto any transcript identified as a centroid transcript, the expressionlevel of which predicts (e.g., infers) the expression levels ofnon-centroid transcripts within the same cluster, and optionally maycontribute to prediction of the expression levels of non-centroidtranscripts in other clusters.

The term “computational analysis” as used herein, refers to anymathematical process that results in the identification of transcriptclusters, wherein the transcripts are derived from a transcriptome. Forexample, specific steps in a computational analysis may include, but arenot limited to, dimensionality reduction and/or cluster analysis.

The term “dependency matrix” as used herein, refers to a table ofweights (i.e., factors) relating the expression levels of a plurality ofcluster centroid landmark transcripts to the expression levels ofnon-centroid transcripts generated by a mathematical analysis (i.e., forexample, regression) of a library of transcriptome-wide gene-expressionprofiles. Cluster dependency matrices may be produced from aheterogeneous library of gene-expression profiles or from libraries ofgene-expression profiles from specific tissues, organs, or diseaseclasses.

The term “algorithm capable of predicting the level of expression oftranscripts” as used herein, refers to any mathematical process thatcalculates the expression levels of non-centroid transcripts given theexpression levels of cluster centroid landmark transcripts and adependency matrix.

The term “invariant transcript” as used herein, refers to any transcriptthat remains at approximately the sample level regardless of cell ortissue type, or the presence of a perturbating agent (i.e., for example,a perturbagen). Invariant transcripts, or sets thereof, may be useful asan internal control for normalizing gene-expression data.

The term “moderate-multiplex assay platform” as used herein, refers toany technology capable of producing simultaneous measurements of theexpression levels of a fraction of the transcripts in a transcriptome(i.e., for example, more than approximately 10 and less thanapproximately 2,000).

The term “Connectivity Map” as used herein, refers to a public databaseof transcriptome-wide gene-expression profiles derived from culturedhuman cells treated with a plurality of perturbagens, andpattern-matching algorithms for the scoring and identification ofsignificant similarities between those profiles and externalgene-expression data, as described by Lamb et al., “The ConnectivityMap: using gene-expression signatures to connect small molecules, genesand disease”. Science 313:1929 (2006). Build02 of the Connectivity Mapcontains 7,056 full-transcriptome gene-expression profiles generatedwith Affymetrix high-density oligonucleotide microarrays representingthe biological effects of 1,309 small-molecule perturbagens, and isavailable at broadinstitute.org/cmap.

The term “query signature” as used herein, refers to any set of up- anddown-regulated genes between two cellular states (e.g., cells treatedwith a small molecule versus cells treated with the vehicle in which thesmall molecule is dissolved) derived from a gene-expression profile thatis suitable to query Connectivity Map. For example, a ‘query signature’may comprise a list of genes differentially expressed in a distinctionof interest; (e.g., disease versus normal), as opposed to an ‘expressionprofile’ that illustrates all genes with their respective expressionlevels.

The term “connectivity score” as used herein, refers to a relativemeasure of the similarity of the biological effects of a perturbagenused to generate a query signature with those of a perturbagenrepresented in the Connectivity Map based upon the gene-expressionprofile of a single treatment with that perturbagen. For example, onewould expect every treatment instances with vorinostat, a known histonedeacetylase (HDAC) inhibitor, to have a high connectivity score with aquery signature generated from the effects of treatments with a panel ofHDAC inhibitors.

The term “enrichment score” as used herein, refers to a measure of thesimilarity of the biological effects of a perturbagen used to generate aquery signature with those of a perturbagen represented in theConnectivity Map based upon the gene-expression profiles of multipleindependent treatments with that perturbagen.

The term “template” as used herein, refers to any stable nucleic acidstructure that represents at least a portion of a cluster centroidlandmark gene transcript nucleic acid sequence. The template may serveto allow the generation of a complementary nucleic acid sequence.

The term “derived from” as used herein, refers to the source of abiological sample, wherein the sample may comprise a nucleic acidsequence. In one respect, a sample or sequence may be derived from anorganism or particular species. In another respect, a sample or sequencemay be derived from (i.e., for example, a smaller portion and/orfragment) a larger composition or sequence.

The term, “purified” or “isolated”, as used herein, may refer to acomponent of a composition that has been subjected to treatment (i.e.,for example, fractionation) to remove various other components. Wherethe term “substantially purified” is used, this designation will referto a composition in which a nucleic acid sequence forms the majorcomponent of the composition, such as constituting about 50%, about 60%,about 70%, about 80%, about 90%, about 95% or more of the composition(i.e., for example, weight/weight and/or weight/volume). The term“purified to homogeneity” is used to include compositions that have beenpurified to “apparent homogeneity” such that there is single nucleicacid species (i.e., for example, based upon SDS-PAGE or HPLC analysis).A purified composition is not intended to mean that some traceimpurities may remain.

As used herein, the term “substantially purified” refers to molecules,such as nucleic acid sequences, that are removed from their naturalenvironment, isolated or separated, and are at least 60% free,preferably 75% free, and more preferably 90% free from other componentswith which they are naturally associated. An “isolated polynucleotide”is therefore a substantially purified polynucleotide.

“Nucleic acid sequence” and “nucleotide sequence” as used herein referto an oligonucleotide or polynucleotide, and fragments or portionsthereof, and to DNA or RNA of genomic or synthetic origin which may besingle- or double-stranded, and represent the sense or antisense strand.

The term “an isolated nucleic acid”, as used herein, refers to anynucleic acid molecule that has been removed from its natural state(e.g., removed from a cell and is, in a preferred embodiment, free ofother genomic nucleic acid).

The term “portion or fragment” when used in reference to a nucleotidesequence refers to smaller subsets of that nucleotide sequence. Forexample, such portions or fragments may range in size from 5 nucleotideresidues to the entire nucleotide sequence minus one nucleic acidresidue.

The term “small organic molecule” as used herein, refers to any moleculeof a size comparable to those organic molecules generally used inpharmaceuticals. The term excludes biological macromolecules (e.g.,proteins, nucleic acids, etc.). Preferred small organic molecules rangein size from approximately 10 Da up to about 5000 Da, more preferably upto 2000 Da, and most preferably up to about 1000 Da.

The term “sample” as used herein is used in its broadest sense andincludes environmental and biological samples. Environmental samplesinclude material from the environment such as soil and water. Biologicalsamples may be animal, including, human, fluid (e.g., blood, plasma andserum), solid (e.g., stool), tissue, liquid foods (e.g., milk), andsolid foods (e.g., vegetables). For example, a pulmonary sample may becollected by bronchoalveolar lavage (BAL) which may comprise fluid andcells derived from lung tissues. A biological sample may comprise acell, tissue extract, body fluid, chromosomes or extrachromosomalelements isolated from a cell, genomic DNA (in solution or bound to asolid support such as for Southern blot analysis), RNA (in solution orbound to a solid support such as for Northern blot analysis), eDNA (insolution or bound to a solid support) and the like.

The term “functionally equivalent codon”, as used herein, refers todifferent codons that encode the same amino acid. This phenomenon isoften referred to as “degeneracy” of the genetic code. For example, sixdifferent codons encode the amino acid arginine.

A “variant” of a nucleotide is defined as a novel nucleotide sequencewhich differs from a reference oligonucleotide by having deletions,insertions and substitutions. These may be detected using a variety ofmethods (e.g., sequencing, hybridization assays etc.).

A “deletion” is defined as a change in a nucleotide sequence in whichone or more nucleotides are absent relative to the native sequence.

An “insertion” or “addition” is that change in a nucleotide sequencewhich has resulted in the addition of one or more nucleotides relativeto the native sequence. A “substitution” results from the replacement ofone or more nucleotides by different nucleotides or amino acids,respectively, and may be the same length of the native sequence buthaving a different sequence.

The term “derivative” as used herein, refers to any chemicalmodification of a nucleic acid. Illustrative of such modifications wouldbe replacement of hydrogen by an alkyl, acyl, or amino group. Forexample, a nucleic acid derivative would encode a polypeptide whichretains essential biological characteristics.

As used herein, the terms “complementary” or “complementarity” are usedin reference to “polynucleotides” and “oligonucleotides” (which areinterchangeable terms that refer to a sequence of nucleotides) relatedby the base-pairing rules. For example, the sequence “C-A-G-T,” iscomplementary to the sequence “G-T-C-A.” Complementarity may be“partial” or “total.” “Partial” complementarity is where one or morenucleic acid bases is not matched according to the base pairing rules.“Total” or “complete” complementarity between nucleic acids is whereeach and every nucleic acid base is matched with another base under thebase pairing rules. The degree of complementarity between nucleic acidstrands has significant effects on the efficiency and strength ofhybridization between nucleic acid strands. This is of particularimportance in amplification reactions, as well as detection methodswhich depend upon binding between nucleic acids.

The terms “homology” and “homologous” as used herein in reference tonucleotide sequences refer to a degree of complementarity with othernucleotide sequences. There may be partial homology or complete homology(i.e., identity). A nucleotide sequence which is partiallycomplementary, i.e., “substantially homologous,” to a nucleic acidsequence is one that at least partially inhibits a completelycomplementary sequence from hybridizing to a target nucleic acidsequence. The inhibition of hybridization of the completelycomplementary sequence to the target sequence may be examined using ahybridization assay (Southern or Northern blot, solution hybridizationand the like) under conditions of low stringency. A substantiallyhomologous sequence or probe will compete for and inhibit the binding(i.e., the hybridization) of a completely homologous sequence to atarget sequence under conditions of low stringency. This is not to saythat conditions of low stringency are such that non-specific binding ispermitted; low stringency conditions require that the binding of twosequences to one another be a specific (i.e., selective) interaction.The absence of non-specific binding may be tested by the use of a secondtarget sequence which lacks even a partial degree of complementarity(e.g., less than about 30% identity); in the absence of non-specificbinding the probe will not hybridize to the second non-complementarytarget.

The terms “homology” and “homologous” as used herein in reference toamino acid sequences refer to the degree of identity of the primarystructure between two amino acid sequences. Such a degree of identitymay be directed a portion of each amino acid sequence, or to the entirelength of the amino acid sequence. Two or more amino acid sequences thatare “substantially homologous” may have at least 50% identity,preferably at least 75% identity, more preferably at least 85% identity,most preferably at least 95%, or 100% identity.

An oligonucleotide sequence which is a “homolog” is defined herein as anoligonucleotide sequence which exhibits greater than or equal to 50%identity to a sequence, when sequences having a length of 100 bp orlarger are compared.

Low stringency conditions comprise conditions equivalent to binding orhybridization at 42° C. in a solution consisting of 5×SSPE (43.8 g/1NaCl, 6.9 g/1 NaH2PO4.H20 and 1.85 g/l EDTA, pH adjusted to 7.4 withNaOH), 0.1% SDS, 5x Denhardt's reagent {50× Denhardt's contains per 500ml: 5 g Ficoll (Type 400, Pharmacia), 5 g BSA (Fraction V; Sigma)} and100 μg/ml denatured salmon sperm DNA followed by washing in a solutionwhich may comprise 5×SSPE, 0.1% SDS at 42° C. when a probe of about 500nucleotides in length. is employed. Numerous equivalent conditions mayalso be employed to comprise low stringency conditions; factors such asthe length and nature (DNA, RNA, base composition) of the probe andnature of the target (DNA, RNA, base composition, present in solution orimmobilized, etc.) and the concentration of the salts and othercomponents (e.g., the presence or absence of formamide, dextran sulfate,polyethylene glycol), as well as components of the hybridizationsolution may be varied to generate conditions of low stringencyhybridization different from, but equivalent to, the above listedconditions. In addition, conditions which promote hybridization underconditions of high stringency (e.g., increasing the temperature of thehybridization and/or wash steps, the use of formamide in thehybridization solution, etc.) may also be used.

As used herein, the term “hybridization” is used in reference to thepairing of complementary nucleic acids using any process by which astrand of nucleic acid joins with a complementary strand through basepairing to form a hybridization complex. Hybridization and the strengthof hybridization (i.e., the strength of the association between thenucleic acids) is impacted by such factors as the degree ofcomplementarity between the nucleic acids, stringency of the conditionsinvolved, the Tm of the formed hybrid, and the G:C ratio within thenucleic acids.

As used herein the term “hybridization complex” refers to a complexformed between two nucleic acid sequences by virtue of the formation ofhydrogen bonds between complementary G and C bases and betweencomplementary A and T bases; these hydrogen bonds may be furtherstabilized by base stacking interactions. The two complementary nucleicacid sequences hydrogen bond in an antiparallel configuration. Ahybridization complex may be formed in solution (e.g., CO t or RO tanalysis) or between one nucleic acid sequence present in solution andanother nucleic acid sequence immobilized to a solid support (e.g., anylon membrane or a nitrocellulose filter as employed in Southern andNorthern blotting, dot blotting or a glass slide as employed in in situhybridization, including FISH (fluorescent in situ hybridization)).

As used herein, the term “Tm ” is used in reference to the “meltingtemperature.” The melting temperature is the temperature at which apopulation of double-stranded nucleic acid molecules becomes halfdissociated into single strands. As indicated by standard references, asimple estimate of the Tm value may be calculated by the equation:Tm=81.5 +0.41 (% G+C), when a nucleic acid is in aqueous solution at 1MNaCl. Anderson et al., “Quantitative Filter Hybridization” In: NucleicAcid Hybridization (1985). More sophisticated computations takestructural, as well as sequence characteristics, into account for thecalculation of Tm.

As used herein the term “stringency” is used in reference to theconditions of temperature, ionic strength, and the presence of othercompounds such as organic solvents, under which nucleic acidhybridizations are conducted. “Stringency” typically occurs in a rangefrom about Tm to about 20° C. to 25° C. below Tm. A “stringenthybridization” may be used to identify or detect identicalpolynucleotide sequences or to identify or detect similar or relatedpolynucleotide sequences. For example, when fragments of SEQ ID NO: 2are employed in hybridization reactions under stringent conditions thehybridization of fragments of SEQ ID NO: 2 which contain uniquesequences (i.e., regions which are either non-homologous to or whichcontain less than about 50% homology or complementarity with SEQ ID NOs:2) are favored. Alternatively, when conditions of “weak” or “low”stringency are used hybridization may occur with nucleic acids that arederived from organisms that are genetically diverse (i.e., for example,the frequency of complementary sequences is usually low between suchorganisms).

As used herein, the term “amplifiable nucleic acid” is used in referenceto nucleic acids which may be amplified by any amplification method. Itis contemplated that “amplifiable nucleic acid” will usually comprise“sample template.”

As used herein, the term “sample template” refers to nucleic acidoriginating from a sample which is analyzed for the presence of a targetsequence of interest. In contrast, “background template” is used inreference to nucleic acid other than sample template which may or maynot be present in a sample. Background template is most ofteninadvertent. It may be the result of carryover, or it may be due to thepresence of nucleic acid contaminants sought to be purified away fromthe sample. For example, nucleic acids from organisms other than thoseto be detected may be present as background in a test sample.

“Amplification” is defined as the production of additional copies of anucleic acid sequence and is generally carried out using polymerasechain reaction. Dieffenbach C. W. and G. S. Dveksler (1995) In: PCRPrimer, a Laboratory Manual, Cold Spring Harbor Press, Plainview, N.Y.

As used herein, the term “polymerase chain reaction” (“PCR”) refers tothe method of K. B. Mullis U.S. Pat. Nos. 4,683,195 and 4,683,202,herein incorporated by reference, which describe a method for increasingthe concentration of a segment of a target sequence in a mixture ofgenomic DNA without cloning or purification. The length of the amplifiedsegment of the desired target sequence is determined by the relativepositions of two oligonucleotide primers with respect to each other, andtherefore, this length is a controllable parameter. By virtue of therepeating aspect of the process, the method is referred to as the“polymerase chain reaction” (hereinafter “PCR”). Because the desiredamplified segments of the target sequence become the predominantsequences (in terms of concentration) in the mixture, they are said tobe “PCR amplified”. With PCR, it is possible to amplify a single copy ofa specific target sequence in genomic DNA to a level detectable byseveral different methodologies (e.g., hybridization with a labeledprobe; incorporation of biotinylated primers followed by avidin-enzymeconjugate detection; incorporation of 32P-labeled deoxynucleotidetriphosphates, such as dCTP or dATP, into the amplified segment). Inaddition to genomic DNA, any oligonucleotide sequence may be amplifiedwith the appropriate set of primer molecules. In particular, theamplified segments created by the PCR process itself are, themselves,efficient templates for subsequent PCR amplifications.

As used herein, the term “primer” refers to an oligonucleotide, whetheroccurring naturally as in a purified restriction digest or producedsynthetically, which is capable of acting as a point of initiation ofsynthesis when placed under conditions in which synthesis of a primerextension product which is complementary to a nucleic acid strand isinduced, (i.e., in the presence of nucleotides and an inducing agentsuch as DNA polymerase and at a suitable temperature and pH). The primeris preferably single stranded for maximum efficiency in amplification,but may alternatively be double stranded. If double stranded, the primeris first treated to separate its strands before being used to prepareextension products. Preferably, the primer is anoligodeoxy-ribonucleotide. The primer must be sufficiently long to primethe synthesis of extension products in the presence of the inducingagent. The exact lengths of the primers will depend on many factors,including temperature, source of primer and the use of the method.

As used herein, the term “probe” refers to an oligonucleotide (i.e., asequence of nucleotides), whether occurring naturally as in a purifiedrestriction digest or produced synthetically, recombinantly or by PCRamplification, which is capable of hybridizing to anotheroligonucleotide of interest. A probe may be single-stranded ordouble-stranded. Probes are useful in the detection, identification andisolation of particular gene sequences. It is contemplated that anyprobe used in the present invention will be labeled with any “reportermolecule,” so that it is detectable in any detection system, including,but not limited to enzyme (e.g., ELISA, as well as enzyme-basedhistochemical assays), fluorescent, radioactive, and luminescentsystems. It is not intended that the present invention be limited to anyparticular detection system or label.

As used herein, the terms “restriction endonucleases” and “restrictionenzymes” refer to bacterial enzymes, each of which cut double-strandedDNA at or near a specific nucleotide sequence.

DNA molecules are said to have “5′ ends” and “3′ ends” becausemononucleotides are reacted to make oligonucleotides in a manner suchthat the 5′ phosphate of one mononucleotide pentose ring is attached tothe 3′ oxygen of its neighbor in one direction via a phosphodiesterlinkage. Therefore, an end of an oligonucleotide is referred to as the“5′ end” if its 5′ phosphate is not linked to the 3′ oxygen of amononucleotide pentose ring. An end of an oligonucleotide is referred toas the “3′ end” if its 3′ oxygen is not linked to a 5′ phosphate ofanother mononucleotide pentose ring. As used herein, a nucleic acidsequence, even if internal to a larger oligonucleotide, also may be saidto have 5′ and 3′ ends. In either a linear or circular DNA molecule,discrete elements are referred to as being “upstream” or 5′ of the“downstream” or 3′ elements. This terminology reflects the fact thattranscription proceeds in a 5′ to 3′ fashion along the DNA strand. Thepromoter and enhancer elements which direct transcription of a linkedgene are generally located 5′ or upstream of the coding region. However,enhancer elements may exert their effect even when located 3′ of thepromoter element and the coding region. Transcription termination andpolyadenylation signals are located 3′ or downstream of the codingregion.

As used herein, the term “an oligonucleotide having a nucleotidesequence encoding a gene” means a nucleic acid sequence which maycomprise the coding region of a gene, i.e. the nucleic acid sequencewhich encodes a gene product. The coding region may be present in acDNA, genomic DNA or RNA form. When present in a DNA form, theoligonucleotide may be single-stranded(i.e., the sense strand) ordouble-stranded. Suitable control elements such as enhancers/promoters,splice junctions, polyadenylation signals, etc. may be placed in closeproximity to the coding region of the gene if needed to permit properinitiation of transcription and/or correct processing of the primary RNAtranscript. Alternatively, the coding region utilized in the expressionvectors of the present invention may contain endogenousenhancers/promoters, splice junctions, intervening sequences,polyadenylation signals, etc. or a combination of both endogenous andexogenous control elements.

The term “poly A site” or “poly A sequence” as used herein denotes a DNAsequence which directs both the termination and polyadenylation of thenascent RNA transcript. Efficient polyadenylation of the recombinanttranscript is desirable as transcripts lacking a poly A tail areunstable and are rapidly degraded. The poly A signal utilized in anexpression vector may be “heterologous” or “endogenous.” An endogenouspoly A signal is one that is found naturally at the 3′ end of the codingregion of a given gene in the genome. A heterologous poly A signal isone which is isolated from one gene and placed 3′ of another gene.Efficient expression of recombinant DNA sequences in eukaryotic cellsinvolves expression of signals directing the efficient termination andpolyadenylation of the resulting transcript. Transcription terminationsignals are generally found downstream of the polyadenylation signal andare a few hundred nucleotides in length.

As used herein, the terms “nucleic acid molecule encoding”, “DNAsequence encoding,” and “DNA encoding” refer to the order or sequence ofdeoxyribonucleotides along a strand of deoxyribonucleic acid. The orderof these deoxyribonucleotides determines the order of amino acids alongthe polypeptide (protein) chain. The DNA sequence thus codes for theamino acid sequence.

The term “Southern blot” refers to the analysis of DNA on agarose oracrylamide gels to fractionate the DNA according to size, followed bytransfer and immobilization of the DNA from the gel to a solid support,such as nitrocellulose or a nylon membrane. The immobilized DNA is thenprobed with a labeled oligodeoxyribonucleotide probe or DNA probe todetect DNA species complementary to the probe used. The DNA may becleaved with restriction enzymes prior to electrophoresis. Followingelectrophoresis, the DNA may be partially depurinated and denaturedprior to or during transfer to the solid support. Southern blots are astandard tool of molecular biologists. J. Sambrook et al. (1989) In:Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Press, N.Y.,pp 9.31-9.58.

The term “Northern blot” as used herein refers to the analysis of RNA byelectrophoresis of RNA on agarose gels to fractionate the RNA accordingto size followed by transfer of the RNA from the gel to a solid support,such as nitrocellulose or a nylon membrane. The immobilized RNA is thenprobed with a labeled oligodeoxyribonucleotide probe or DNA probe todetect RNA species complementary to the probe used. Northern blots are astandard tool of molecular biologists. J. Sambrook, J. et al. (1989)supra, pp 7.39-7.52.

The term “reverse Northern blot” as used herein refers to the analysisof DNA by electrophoresis of DNA on agarose gels to fractionate the DNAon the basis of size followed by transfer of the fractionated DNA fromthe gel to a solid support, such as nitrocellulose or a nylon membrane.The immobilized DNA is then probed with a labeled oligoribonucleotideprobe or RNA probe to detect DNA species complementary to the ribo probeused.

As used herein the term “coding region” when used in reference to astructural gene refers to the nucleotide sequences which encode theamino acids found in the nascent polypeptide as a result of translationof a mRNA molecule. The coding region is bounded, in eukaryotes, on the5′ side by the nucleotide triplet “ATG” which encodes the initiatormethionine and on the 3′ side by one of the three triplets which specifystop codons (i.e., TAA, TAG, TGA).

As used herein, the term “structural gene” refers to a DNA sequencecoding for RNA or a protein. In contrast, “regulatory genes” arestructural genes which encode products which control the expression ofother genes (e.g., transcription factors).

As used herein, the term “gene” means the deoxyribonucleotide sequenceswhich may comprise the coding region of a structural gene and includingsequences located adjacent to the coding region on both the 5′ and 3′ends for a distance of about 1 kb on either end such that the genecorresponds to the length of the full-length mRNA. The sequences whichare located 5′ of the coding region and which are present on the mRNAare referred to as 5′ non-translated sequences. The sequences which arelocated 3′ or downstream of the coding region and which are present onthe mRNA are referred to as 3′ non-translated sequences. The term “gene”encompasses both cDNA and genomic forms of a gene. A genomic form orclone of a gene contains the coding region interrupted with non-codingsequences termed “introns” or “intervening regions” or “interveningsequences.” Introns are segments of a gene which are transcribed intoheterogeneous nuclear RNA (hnRNA); introns may contain regulatoryelements such as enhancers. Introns are removed or “spliced out” fromthe nuclear or primary transcript; introns therefore are absent in themessenger RNA (mRNA) transcript. The mRNA functions during translationto specify the sequence or order of amino acids in a nascentpolypeptide.

In addition to containing introns, genomic forms of a gene may alsoinclude sequences located on both the 5′ and 3′ end of the sequenceswhich are present on the RNA transcript. These sequences are referred toas “flanking” sequences or regions (these flanking sequences are located5′ or 3′ to the non-translated sequences present on the mRNAtranscript). The 5′ flanking region may contain regulatory sequencessuch as promoters and enhancers which control or influence thetranscription of the gene. The 3′ flanking region may contain sequenceswhich direct the termination of transcription, posttranscriptionalcleavage and polyadenylation.

The term “label” or “detectable label” is used herein, to refer to anycomposition detectable by spectroscopic, photochemical, biochemical,immunochemical, electrical, optical or chemical means. Such labelsinclude biotin for staining with labeled streptavidin conjugate,magnetic beads (e.g., Dynabeads®), fluorescent dyes (e.g., fluorescein,Texas red, rhodamine, green fluorescent protein, and the like),radiolabels (e.g., 3H, 125I, 35S, 14C, or 32P), enzymes (e.g., horseradish peroxidase, alkaline phosphatase and others commonly used in anELISA), and calorimetric labels such as colloidal gold or colored glassor plastic (e.g., polystyrene, polypropylene, latex, etc.) beads.Patents teaching the use of such labels include, but are not limited to,U.S. Pat. Nos. 3,817,837; 3,850,752; 3,939,350; 3,996,345; 4,277,437;4,275,149; and 4,366,241 (all herein incorporated by reference). Thelabels contemplated in the present invention may be detected by manymethods. For example, radiolabels may be detected using photographicfilm or scintillation counters, fluorescent markers may be detectedusing a photodetector to detect emitted light. Enzymatic labels aretypically detected by providing the enzyme with a substrate anddetecting, the reaction product produced by the action of the enzyme onthe substrate, and calorimetric labels are detected by simplyvisualizing the colored label.

The present invention is related to the field of genomic informatics andgene-expression profiling. Gene-expression profiles provide complexmolecular fingerprints regarding the relative state of a cell or tissue.Similarities in gene-expression profiles between organic states (i.e.,for example, normal and diseased cells and/or tissues) provide moleculartaxonomies, classification, and diagnostics. Similarities ingene-expression profiles resulting from various external perturbations(i.e., for example, ablation or enforced expression of specific genes,and/or small molecules, and/or environmental changes) reveal functionalsimilarities between these perturbagens, of value in pathway andmechanism-of-action elucidation. Similarities in gene-expressionprofiles between organic (e.g. disease) and induced (e.g. by smallmolecule) states may identify clinically-effective therapies.Improvements described herein allow for the efficient and economicalgeneration of full-transcriptome gene-expression profiles by identifyingcluster centroid landmark transcripts that predict the expression levelsof other transcripts within the same cluster.

Some embodiments of the present invention contemplate performinggenome-wide transcriptional profiling for applications including, butnot limited to, disease classification and diagnosis without resort toexpensive and laborious microarray technology (i.e., for example,Affymetrix GeneChip microarrays). Other uses include, but are notlimited to, generating gene-expression data for use in and withinformation databases (i.e., for example, connectivity maps). Aconnectivity map typically may comprise a collection of a large numberof gene-expression profiles together with allied pattern-matchingsoftware. The collection of profiles is searched with thepattern-matching algorithm for profiles that are similar togene-expression data derived from a biological state of interest. Theutility of this searching and pattern-matching exercise resides in thebelief that similar biological states may be identified through thetransitory feature of common gene-expression changes. Thegene-expression profiles in a connectivity map may be derived from knowncellular states, or cells or tissues treated with known chemical orgenetic perturbagens. In this mode, the connectivity map is a tool forthe functional annotation of the biological state of interest.Alternatively, the connectivity map is populated with gene-expressionprofiles from cells or tissues treated with previously uncharacterizedor novel perturbagens. In this mode, the connectivity map functions as ascreening tool. Most often, a connectivity map is populated withprofiles of both types. Connectivity maps, in general, establishbiologically-relevant connections between disease states, gene-productfunction, and small-molecule action. In particular, connectivity mapshave wide-ranging applications including, but not limited to, functionalannotation of unknown genes and biological states, identification of themode of action or functional class of a small molecule, and theidentification of perturbagens that modulate or reverse a disease statetowards therapeutic advantage as potential drugs. See Lamb et al, “TheConnectivity Map: using gene-expression signatures to connect smallmolecules, genes and disease” Science 313: 1929-1935 (2006), and Lamb,“The Connectivity Map: a new tool for biomedical research” NatureReviews Cancer 7: 54-60 (2007). However, the high cost of generatinggene-expression profiles severely limits the size and scope ofconnectivity maps. A connectivity map populated with gene-expressionprofiles derived from every member of an industrial small-moleculedrug-screening library, a saturated combinatorial ordiversity-orientated chemical library, a comprehensive collection ofcrude or purified plant or animal extracts, or from the genetic ablationor forced expression of every gene in a mammalian genome, for example,would be expected to facilitate more, and more profound, biologicaldiscoveries than those of existing connectivity maps. Although it is notnecessary to understand the mechanism of an invention, it is believedthat the presently disclosed method for gene-expression profilingreduces the cost of generating these profiles by more than 30-fold. Thepresent invention contemplates the creation of connectivity maps with atleast 100,000 gene-expression profiles, and ultimately, many millions ofgene-expression profiles.

The present invention contemplates compositions and methods for makingand using a transcriptome-wide gene-expression profiling platform thatmeasures the expression levels of only a select subset of the totalnumber of transcripts. Because gene expression is believed to be highlycorrelated, direct measurement of a small number (for example, 1,000) ofappropriately-selected “landmark” transcripts allows the expressionlevels of the remainder to be inferred. The present invention,therefore, has the potential to reduce the cost and increase thethroughput of full-transcriptome gene-expression profiling relative tothe well-known conventional approaches that require all transcripts tobe measured.

In one embodiment, the present invention contemplates identifyinglandmark transcripts from a computational analysis of a large collectionof transcriptome-wide gene-expression profiles. In one embodiment, theprofiles contain identities and expression levels of a large proportion(preferably more than 70%) of the known transcripts in the genome. Inone preferred embodiment, the profiles are generated by the use ofhigh-density DNA microarrays commercially-available from, but notlimited to, Affymetrix, Agilent, and Illumina. Suitable profiles mayalso be generated by other transcriptome-analysis methods including, butnot limited to, Serial Analysis of Gene Expression (SAGE) and deep cDNAsequencing. In one preferred embodiment, all profiles are generated withthe same analysis method. In one especially preferred embodiment, allprofiles are generated using Affymetrix oligonucleotide microarrays. Inone embodiment, the number of profiles in the collection exceeds 1,000,and preferably is more than 10,000. In one preferred embodiment, theprofiles derive from a broad diversity of normal and diseased tissueand/or cell types. As known to those skilled in the art, collections ofsuitable gene-expression profiles are available from public and private,commercial sources. In one preferred embodiment, gene-expressionprofiles are obtained from NCBI's Gene Expression Omnibus (GEO). In oneembodiment, expression levels in the profiles in the collection arescaled relative to each other. Those skilled in the art will be aware ofa variety of methods to achieve such normalization, including, but notlimited to, quantile normalization (preferably RMA). In one preferredembodiment, expression levels in the profiles in the collection arescaled relative to each other using a set of transcripts (numberingapproximately 100, and preferably approximately 350) having the lowestcoefficients-of-variation (CV) of all transcripts at each of a number(preferably approximately 14) of expression levels chosen to span therange of expression levels observed, from an independent collection oftranscriptome-wide gene-expression profiles (numbering at least 1,000and preferably approximately 7,000).

In one preferred embodiment, profiles used to identify landmarktranscripts are required to exceed a minimum standard for data quality(i.e., for example, quality control (QC) analysis). The samples passingthe QC analysis are identified as a core dataset. Suitable data-qualitymeasures are known to those skilled in the art and include, but are notlimited to, percentage-of-P-cells and 3′-to-5′ ratios. In oneembodiment, an empirical distribution of data-quality measures is builtand outlier profiles eliminated from the collection. In one preferredembodiment, profiles with data-quality measures beyond the 95thpercentile of the distribution are eliminated from the collection. Inone preferred embodiment, the set of transcripts represented in allprofiles in the collection is identified, and the remainder eliminatedfrom all of the profiles. In one embodiment, the set of transcriptsbelow the limit of detection in a large proportion of the profiles(preferably 99%) are eliminated from the profiles.

In one embodiment, the present invention contemplates usingdimensionality reduction in combination with cluster analysis to selecttranscripts to be measured (i.e., for example, landmark transcripts).While dimensionality reduction may be performed by a number of knownmethods, the embodiments described herein utilize principal componentanalysis. In one embodiment, the method further may comprise using alinear dimension reduction method (i.e., for example, usingeigenvectors). In one embodiment, the cluster analysis creates aplurality of clusters wherein each cluster may comprise a single clustercentroid landmark transcript and a plurality of cluster non-centroidtranscripts. See FIG. 1. In one preferred embodiment, clusters areachieved by using k-means clustering, wherein the k-means clustering isrepeated a number of times allowing a consensus matrix to be constructed(i.e., for example, a gene-by-gene pairwise consensus matrix).

In one preferred embodiment, pockets of high local correlation areidentified by hierarchically clustering the gene-by-gene pairwiseconsensus matrix. As is known to those skilled in the art, the tree fromthe hierarchical clustering may then be cut at multiple levels. At eachlevel, there are numerous nodes, wherein the leaves (i.e., for example,illustrated herein as transcripts) in each node represent a tightcluster. For each tight cluster, a representative centroid ‘landmark’transcript may be chosen by picking the transcript whose individualprofile most closely correlates with the tight-cluster's mean profile.In one preferred embodiment, the cluster analysis identifies multiple(preferably more than 3 and less than 10) centroid landmark transcripts.Although it is not necessary to understand the mechanism of aninvention, it is believed that the expression level of cluster centroidlandmark transcripts may be used to infer the expression level of theassociated cluster non-centroid transcripts.

In one embodiment, the present invention contemplates a method which maycomprise creating gene-expression profiles from data consisting only ofcluster centroid landmark transcript expression-level measurements. Inone embodiment, medically-relevant similarities between biologicalsamples are identified by similarities in their correspondinggene-expression profiles produced in the space of cluster centroidlandmark transcripts.

In one preferred embodiment, the levels of non-measured transcripts in anew biological sample are inferred (i.e., for example, predicted) fromthe measurements of the landmark transcripts with reference to adependency matrix, thereby creating a full-transcriptome gene-expressionprofile. In one embodiment, a dependency matrix is constructed byperforming linear regression between the expression levels of each ofthe cluster centroid landmark genes (g) and the expression levels of allof the non-landmark transcripts (G) in a collection oftranscriptome-wide expression profiles. In one preferred embodiment, apseudo-inverse is used to build the dependency matrix (G non-landmarktranscripts x g landmark transcripts). In one preferred embodiment, thecollection of transcriptome-wide expression profiles used to build thedependency matrix is the same collection used to identify the clustercentroid landmark transcripts. In another embodiment, the collection oftranscriptome-wide expression profiles used to build the dependencymatrix is different from that used to identify the cluster centroidlandmark transcripts. In one preferred embodiment, multiple dependencematrices are constructed from collections of transcriptome-wideexpression profiles, each collection populated with profiles derivedfrom the same type of normal or diseased tissues or cells. In oneembodiment, the choice of dependency matrix to use for the inference ismade based upon knowledge of the tissue, cell and/or pathological stateof the sample. In one preferred embodiment, the expression level of eachnon-landmark transcript in a new biological sample is inferred bymultiplying the expression levels of each of the landmark transcripts bythe corresponding weights looked up from the dependency matrix, andsumming those products.

In one preferred embodiment, the present invention contemplates a methodwhich may comprise the creation of full-transcriptome gene-expressionprofiles using measurements of a plurality of landmark transcripts andinference of non-landmark transcript levels, wherein those profiles haveat least 80% of the performance of gene-expression profiles produced bydirect measurement of all transcripts, in a useful application ofgene-expression profiling.

In one embodiment, the present invention contemplates determining thenumber of cluster centroid landmark transcripts suitable for thecreation of transcriptome-wide gene-expression profiles byexperimentation. In one embodiment, the number of cluster centroidlandmark transcripts suitable for the creation of transcriptome-widegene-expression profiles is determined by simulation.

A computational simulation presented herein (Examples I and II)demonstrates that dimensionality reduction may be applied to theidentification of a plurality of cluster centroid landmark transcripts,and that surprisingly few landmark-transcript measurements aresufficient to faithfully recreate full-transcriptome profiles. It isshown that the expression levels of only 1,000 cluster centroid landmarktranscripts (i.e., for example, <5% of transcripts in the transcriptome)may be used to recreate full-transcriptome expression profiles thatperform as well as profiles in which all transcripts were measureddirectly in 80% of tests for profile similarity examined. Further, thesedata demonstrate that 500 centroid landmark transcripts (i.e., forexample, <2.5% of transcripts in the transcriptome) recoversapproximately 50% of such similarities (FIG. 2).

In one preferred embodiment, the present invention contemplates a methodwhich may comprise approximately 1,000 cluster centroid landmarktranscripts from which the expression levels of the remainder of thetranscriptome may be inferred.

In one embodiment, the present invention contemplates measuring theexpression levels of a set of cluster centroid landmark transcripts in abiological sample which may comprise a plurality of transcripts, andusing a corresponding dependency matrix to predict the expression levelsof the transcripts not measured, thereby creating a full-transcriptomeexpression profile. In one preferred embodiment, the expression levelsof the set of cluster centroid landmark transcripts are measuredsimultaneously. In another preferred embodiment, the number of clustercentroid landmark transcripts measured is approximately 1,000. Inanother preferred embodiment, the expression levels of the set ofcluster centroid landmark transcripts are measured using amoderate-multiplex assay platform. As is well known to those skilled inthe art, there are many methods potentially capable of determining theexpression level of a moderate number (i.e. approximately 10 toapproximately 1,000) of transcripts simultaneously. These include, butare not limited to, multiplexed nuclease-protection assay, multiplexedRT-PCR, DNA microarrays, nucleic-acid sequencing, and various commercialsolutions offered by companies including, but not limited to, Panomics,High Throughput Genomics, NanoString, Fluidigm, Nimblegen, Affymetrix,Agilent, and Illumina.

In one preferred embodiment, the present invention contemplates a methodfor generating a full-transcriptome gene-expression profile bysimultaneously measuring the expression levels of a set of clustercentroid landmark transcripts in a biological sample which may comprisea plurality of transcripts, and using a corresponding dependency matrixto predict the expression levels of the transcripts not measured, wherethe said simultaneous measurements are made using nucleic-acidsequencing.

In one preferred embodiment, the present invention contemplates a methodfor generating a full-transcriptome gene-expression profile bysimultaneously measuring the expression levels of a set of clustercentroid landmark transcripts in a biological sample which may comprisea plurality of transcripts, and using a corresponding dependency matrixto predict the expression levels of the transcripts not measured, wherethe said simultaneous measurements are made using multiplexligation-mediated amplification with Luminex FlexMAP optically-addressedand barcoded microspheres and flow-cytometric detection (LMF); Peck etal., “A method for high-throughput gene expression signature analysis”Genome Biology 7:R61 (2006). See FIG. 3. In this technique, transcriptsare captured on immobilized poly-dT and reverse transcribed. Twooligonucleotide probes are designed for each transcript of interest.Upstream probes contain 20 nt complementary to a universal primer (T7)site, one of a set of unique 24 nt barcode sequences, and a 20 ntsequence complementary to the corresponding first-strand cDNA.Downstream probes are 5′-phosphorylated and contain 20 nt contiguouswith the gene-specific fragment of the corresponding upstream probe anda 20 nt universal-primer (T3) site. Probes are annealed to target cDNAs,free probes removed, and juxtaposed probes joined by the action ofligase enzyme to yield 104 nt amplification templates. PCR is performedwith T3 and 5′-biotinylated T7 primers. Biotinylated barcoded ampliconsare hybridized against a pool of optically-addressed microspheres eachexpressing capture probes complementary to a barcode, and incubated withstreptavidin-phycoerythrin to label biotin moieties fluorescently.Captured labeled amplicons are quantified and beads decoded by flowcytometry in Luminex detectors. The above reported LMF method waslimited to measuring 100 transcripts simultaneously due to theavailability of only 100 optical addresses. In one embodiment, thepresent invention contemplates a method for generating gene-expressionprofiles using simultaneous measurement of the levels of clustercentroid landmark transcripts that is compatible with an expanded number(approximately 500, and preferably 1,000) of barcode sequences, andoptically-addressed microspheres and a corresponding flow-cytometricdetection device. In one embodiment, the present invention contemplatesa method which may comprise two assays per biological sample, eachcapable of measuring the expression levels of approximately 500 clustercentroid transcripts. In one embodiment, the present inventioncontemplates a method were the expression levels of approximately 1,000cluster centroid landmark transcripts are measured in one assay perbiological sample using less than 1,000 populations ofoptically-addressed microspheres by arranging for microspheres toexpress more than one type of capture probe complementary to a barcode.In one embodiment, the present invention contemplates a method which maycomprise one assay per sample, each capable of measuring the expressionlevels of 1,000 cluster centroid landmark transcripts.

As is well known to those skilled in the art, an estimate of theexpression level of a transcript made with one method (e.g. RT-PCR) doesnot always agree with the estimate of the expression level of that sametranscript in the same biological sample made with another method (e.g.DNA microarray). In one embodiment, the present invention contemplates amethod for selecting the set of cluster centroid landmark transcripts tobe measured by a given moderate-multiplex assay platform for thepurposes of predicting the expression levels of transcripts notmeasured, and thereby to create a full-transcriptome gene-expressionprofile, from the set of all possible cluster centroid landmarktranscripts by experimentation. In one preferred embodiment, the set ofcluster centroid landmark transcripts to be measured by a givenmoderate-multiple assay platform is selected by empirically confirmingconcordance between measurements of expression levels of clustercentroid landmark transcripts made by that platform and those made usingthe transcriptome-wide gene-expression profiling technology used togenerate the collection of gene-expression profiles from which theuniverse of cluster centroid landmark transcripts was originallyselected. In one especially preferred embodiment, the expression levelsof all possible cluster centroid landmark transcripts (preferablynumbering approximately 1,300) in a collection of biological samples(preferably numbering approximately 384) are estimated by both LMF andAffymetrix oligonucleotide microarrays, where Affymetrix oligonucleotidemicroarrays were used to produce the transcriptome-wide gene-expressionprofiles from which the universe of possible cluster centroid landmarktranscripts was selected, resulting in the identification of a set ofcluster centroid landmark transcripts (preferably numberingapproximately 1,100) whose expression level estimated by LMF isconsistently concordant with the expression levels estimated byAffymetrix oligonucleotide microarrays. Data presented herein (ExampleIII) show unanticipated discordances between expression-levelmeasurements made using LMF and Affymetrix oligonucleotide microarrays.

In one embodiment, the present invention contemplates a method forselecting the final set of cluster centroid landmark transcripts to bemeasured by a given moderate-multiplex assay platform for the purposesof predicting the expression levels of transcripts not measured, andthereby to create a full-transcriptome gene-expression profile, from theset of all possible cluster centroid landmark transcripts byexperimentation. In one preferred embodiment, the set of clustercentroid landmark transcripts to be measured by a givenmoderate-multiple assay platform is selected by empirically confirmingthat measurements of their expression levels made by that platform maybe used to predict the expression level of non-landmark transcripts intheir cluster measured using the transcriptome-wide gene-expressionprofiling technology used to generate the collection of gene-expressionprofiles from which the universe of cluster centroid landmarktranscripts was selected.

In one especially preferred embodiment, the expression levels of allpossible cluster centroid landmark transcripts (preferably numberingapproximately 1,300) in a collection of biological samples (preferablynumbering approximately 384) are measured by LMF, and the expressionlevels of all non-landmark transcripts are measured in that samecollection of biological samples by Affymetrix oligonucleotidemicroarrays, where Affymetrix oligonucleotide microarrays were used toproduce the transcriptome-wide gene-expression profiles from which theuniverse of possible cluster centroid landmark transcripts was selected,resulting in the identification of a final set of cluster centroidlandmark transcripts (preferably numbering approximately 1,000) whoseexpression levels estimated by LMF may consistently be used to predictthe expression level of transcripts in their clusters as measured byAffymetrix oligonucleotide microarrays. Data presented herein (ExampleIII) show unanticipated failures of measurements of the expressionlevels of certain cluster centroid landmark made using LMF to be usefulfor predicting the expression levels of transcripts in their clustermeasured using Affymetrix oligonucleotide microarrays.

In one embodiment, the present invention contemplates creating adependency matrix specific to the final set of cluster centroid landmarktranscripts selected for a given moderate-multiplex assay platform.

Data presented herein (Examples IV, V, VI, VII) demonstrate thegeneration of useful transcriptome-wide gene-expression profiles fromthe measurement of the expression levels of a set of cluster centroidlandmark transcripts selected for use with a specific moderate-multiplexassay platform.

In one embodiment, the present invention contemplates a method which maycomprise normalization (i.e., for example, scaling) of gene-expressiondata to correct for day-to-day or detector-to-detector variability insignal intensities. Although it is not necessary to understand themechanism of an invention, it is believed that in transcriptome-widegene-expression profiles (i.e., for example, high-density microarraydata with approximately 20,000 dimensions) convention assumes that thevast majority of the transcripts do not change in a given state. Such anassumption allows a summation of the expression levels for alltranscripts to be taken as a measure of overall signal intensity. Thoseusing conventional systems then normalize the expression level of eachtranscript against that overall signal-intensity value.

However, when using gene-expression profiles of lower dimensionality(i.e., for example, 1,000 transcripts) it is not reasonable to supposethat only a small fraction of those transcripts change, especially inthe special case of cluster centroid landmark transcripts where thetranscripts were selected, in part, because each exhibited differentlevels across a diversity of samples. Consequently, normalizationrelative to a sum of the levels of all transcripts is not suitable.

In one embodiment, the present invention contemplates normalizinggene-expression profiles relative to a set of transcripts whose levelsdo not change across a large collection of diverse sample (i.e., forexample, invariant transcripts). Such a process is loosely analogous tothe use of a so-called housekeeping gene (i.e., for example, GAPDH) as areference in a qRT-PCR. Although it is not necessary to understand themechanism of an invention, it is believed that the normalizationdescribed herein is superior to other known normalization techniquesbecause the invariant transcripts are empirically determined to haveinvariant expression across a broad diversity of samples.

In one embodiment, the set of transcripts (numbering between 10 and 50,preferably 25) having the lowest coefficients-of-variation (CV) of alltranscripts at each of a number (preferably approximately 14) ofexpression levels chosen to span the range of expression levels observedfrom a collection of transcriptome-wide gene-expression profiles(numbering at least 1,000 and preferably approximately 7,000), areidentified as invariant transcripts. In one preferred embodiment, thecollection of transcriptome-wide gene-expression profiles used toselected invariant transcripts is build02 of the Connectivity Mapdataset (broadinstitute.org/cmap). In one preferred embodiment, a finalset of invariant transcripts (numbering between 14 and 98, preferably80) to be used to normalize measurements of expression levels of clustercentroid landmark transcripts made using a given moderate-multiplexassay platform is selected from the set of all invariant transcripts byempirically confirming concordance between measurements of theirexpression levels made by that platform and those made using thetranscriptome-wide gene-expression profiling technology used to generatethe collection of gene-expression profiles from which the invarianttranscripts were originally identified, and that their expression levelsare indeed substantially invariant, in a collection of biologicalsamples (numbering preferably approximately 384).

Data presented herein (Examples IV, V, VI, VII) demonstrate thegeneration of useful transcriptome-wide gene-expression profiles fromthe measurement of the expression levels of a set of cluster centroidlandmark transcripts measured on a selected moderate-multiple assayplatform scaled relative to the expression levels of a set of invarianttranscripts measured together on the same platform.

It has been reported that gene regulation may be studied on a genomiclevel using dimensionality reduction in combination with clusteringtechniques. For example, gene co-regulation may be inferred from geneco-expression dynamics (i.e., for example, gene-gene interactions) usinga dimensionally reduced biological dataset. Capobianco E., “ModelValidation For Gene Selection And Regulation Maps” Funct Integr Genomics8(2):87-99 (2008). This approach suggests three feature extractionmethods that may detect genes with the greatest differential expressionby clustering analysis (i.e., for example, k-means) in combination withprincipal and/or independent component analysis. In transcriptomics, forinstance, clusters may be formed by genes having similar expressionpatterns. Dimensionality reduction, however, is used primarily toeliminate “noise” from useful biological information. A correlationmatrix may be computed whose decomposition applies according to aneigensystem including eigenvalues (i.e., for example, the energies ofthe modes) and eigenvectors (i.e., for example, y, determined bymaximizing the energy in each mode). Selecting representativedifferentially expressed genes may be performed by ‘regularization viashrinkage’ that isolates cluster outliers to pick the genes having thegreatest differential levels of expression.

Other dimensionality reduction methods have been used in proteomicbiomarker studies. For example, mass-spectra based proteomic profileshave been used as disease biomarkers that generate datasets havingextremely high dimensionality (i.e. number of features or variables) ofproteomic data with a small sample size. Among these methods, one reportsuggests using a feature selection method described as centroidshrinkage, wherein data sets may be evaluated using causal inferencetechniques. Training samples are used to identify class centroids,wherein a test sample is assigned to a class belonging to the closestcentroid. Hilario et al., “Approaches To Dimensionality Reduction InProteomic Biomarker Studies” Brief Bioinform 9(2):102-118 (2008).Centroid shrinkage analysis has been previously used in gene expressionanalysis to diagnose cancers.

One dimensionality reduction report identifies a subset of features fromwithin a large set of features. Such a selection process is performed bytraining a support vector machine to rank the features according toclassifier weights. For example, a selection may be made for thesmallest number of genes that are capable of accurately distinguishingbetween medical conditions (i.e., for example, cancer versusnon-cancer). Principal component analysis is capable of clustering geneexpression data, wherein specific genes are selected within each clusteras highly correlated with the expression of cancer. Golub's eigenspacevector method to predict gene function with cancer is directly comparedand contrasted as an inferior method. Barnhill et al., “FeatureSelection Method Using Support Vector Machine Classifier” United StatesPatent 7,542,959 (co135-49).

Linear transformations (i.e., for example, principal component analysis)may also be capable of identifying low-dimensional embeddings ofmultivariate data, in a way that optimally preserves the structure ofthe data. In particular, the performance of dimensionality reduction maybe enhanced. Furthermore, the resulting dimensionality reduction maymaintain data coordinates and pairwise relationships between the dataelements. Subsequent clustering of decomposition information may beintegrated in the linear transformation that clearly show separationbetween the clusters, as well as their internal structure. Koren et al.,“Robust Linear Dimensionality Reduction” IEEE Trans Vis Comput Graph.10(4):459-470 (2004).

Further, the invention encompasses methods and systems for organizingcomplex and disparate data. Principal component analysis may be used toevaluate phenotypic, gene expression, and metabolite data collected fromArabidopsis plants treated with eighteen different herbicides. Geneexpression and transcription analysis was limited to evaluating geneexpression in the context of cell function. Winfield et al., “MethodsAnd Systems For Analyzing Complex Biological Systems” U.S. Pat. No.6,873,914.

Functional genomics and proteomics may be studied involving thesimultaneous analysis of hundreds or thousands of expressed genes orproteins. From these large datasets, dimensionality reduction strategieshave been used to identify clinically exploitable biomarkers fromenormous experimental datasets. The field of transcriptomics couldbenefit from using dimensionality reduction methods in high-throughputmethods using microarrays. Finn W G., “Diagnostic Pathology AndLaboratory Medicine In The Age Of”“omics” J Mol Diagn. 9(4):431-436(2007).

Multifactor dimensionality reduction (MDR) may also be useful fordetecting and modeling epistasis, including the identification of singlenucleotide polymorphisms (SNPs). MDR pools genotypes into ‘high-risk’and low-risk' or ‘response’ and ‘non-response’ groups in order to reducemultidimensional data into only one dimension. MDR has detectedgene-gene interactions in diseases such as sporadic breast cancer,multiple sclerosis and essential hypertension. MDR may be useful inevaluating most common diseases that are caused by the non-linearinteraction of numerous genetic and environmental variables. Motsingeret al., “Multifactor Dimensionality Reduction: An Analysis Strategy ForModeling And Detecting Gene-Gene Interactions In Human Genetics AndPharmacogenomics Studies” Hum Genomics 2(5):318-328 (2006).

Another report attempted to use 6,100 transcripts to represent theentire transcriptome in an effort to avoid measuring for genes that werenot expected to be expressed. Hoshida et al, “Gene Expression in FixedTissues and Outcome in Hepatocellular Carcinoma” New Engl J Med 259:19(2008).

mRNA expression may be measured by any suitable method, including butnot limited to, those disclosed below.

In some embodiments, RNA is detected by Northern blot analysis. Northernblot analysis involves the separation of RNA and hybridization of acomplementary labeled probe.

In other embodiments, RNA expression is detected by enzymatic cleavageof specific structures (INVADER assay, Third Wave Technologies; Seee.g., U.S. Pat. Nos. 5,846,717, 6,090,543; 6,001,567; 5,985,557; and5,994,069; each of which is herein incorporated by reference). TheINVADER assay detects specific nucleic acid (e.g., RNA) sequences byusing structure-specific enzymes to cleave a complex formed by thehybridization of overlapping oligonucleotide probes.

In still further embodiments, RNA (or corresponding cDNA) is detected byhybridization to an oligonucleotide probe. A variety of hybridizationassays using a variety of technologies for hybridization and detectionare available. For example, in some embodiments, TaqMan assay (PEBiosystems, Foster City, Calif.; See e.g., U.S. Pat. Nos. 5,962,233 and5,538,848, each of which is herein incorporated by reference) isutilized. The assay is performed during a PCR reaction. The TaqMan assayexploits the 5′-3′ exonuclease activity of the AMPLITAQ GOLD DNApolymerase. A probe consisting of an oligonucleotide with a 5′-reporterdye (e.g., a fluorescent dye) and a 3′-quencher dye is included in thePCR reaction. During PCR, if the probe is bound to its target, the 5′-3′nucleolytic activity of the AMPLITAQ GOLD polymerase cleaves the probebetween the reporter and the quencher dye. The separation of thereporter dye from the quencher dye results in an increase offluorescence. The signal accumulates with each cycle of PCR and may bemonitored with a fluorimeter.

In yet other embodiments, reverse-transcriptase PCR (RT-PCR) is used todetect the expression of RNA. In RT-PCR, RNA is enzymatically convertedto complementary DNA or “cDNA” using a reverse transcriptase enzyme. ThecDNA is then used as a template for a PCR reaction. PCR products may bedetected by any suitable method, including but not limited to, gelelectrophoresis and staining with a DNA specific stain or hybridizationto a labeled probe. In some embodiments, the quantitative reversetranscriptase PCR with standardized mixtures of competitive templatesmethod described in U.S. Pat. Nos. 5,639,606, 5,643,765, and 5,876,978(each of which is herein incorporated by reference) is utilized.

The method most commonly used as the basis for nucleic acid sequencing,or for identifying a target base, is the enzymatic chain-terminationmethod of Sanger. Traditionally, such methods relied on gelelectrophoresis to resolve, according to their size, wherein nucleicacid fragments are produced from a larger nucleic acid segment. However,in recent years various sequencing technologies have evolved which relyon a range of different detection strategies, such as mass spectrometryand array technologies.

One class of sequencing methods assuming importance in the art are thosewhich rely upon the detection of PPi release as the detection strategy.It has been found that such methods lend themselves admirably to largescale genomic projects or clinical sequencing or screening, whererelatively cost-effective units with high throughput are needed.

Methods of sequencing based on the concept of detecting inorganicpyrophosphate (PPi) which is released during a polymerase reaction havebeen described in the literature for example (WO 93/23564, WO 89/09283,WO 98/13523 and WO 98/28440). As each nucleotide is added to a growingnucleic acid strand during a polymerase reaction, a pyrophosphatemolecule is released. It has been found that pyrophosphate releasedunder these conditions may readily be detected, for example enzymicallye.g. by the generation of light in the luciferase-luciferin reaction.Such methods enable a base to be identified in a target position and DNAto be sequenced simply and rapidly whilst avoiding the need forelectrophoresis and the use of labels.

At its most basic, a PPi-based sequencing reaction involves simplycarrying out a primer-directed polymerase extension reaction, anddetecting whether or not that nucleotide has been incorporated bydetecting whether or not PPi has been released. Conveniently, thisdetection of PPi-release may be achieved enzymatically, and mostconveniently by means of a luciferase-based light detection reactiontermed ELIDA (see further below).

It has been found that dATP added as a nucleotide for incorporation,interferes with the luciferase reaction used for PPi detection.Accordingly, a major improvement to the basic PPi-based sequencingmethod has been to use, in place of dATP, a dATP analogue (specificallydATP.alpha.s) which is incapable of acting as a substrate forluciferase, but which is nonetheless capable of being incorporated intoa nucleotide chain by a polymerase enzyme (WO 98/13523).

Further improvements to the basic PPi-based sequencing technique includethe use of a nucleotide degrading enzyme such as apyrase during thepolymerase step, so that unincorporated nucleotides are degraded, asdescribed in WO 98/28440, and the use of a single-stranded nucleic acidbinding protein in the reaction mixture after annealing of the primersto the template, which has been found to have a beneficial effect inreducing the number of false signals, as described in WO00/43540.

In other embodiments, gene expression may be detected by measuring theexpression of a protein or polypeptide. Protein expression may bedetected by any suitable method. In some embodiments, proteins aredetected by immunohistochemistry. In other embodiments, proteins aredetected by their binding to an antibody raised against the protein. Thegeneration of antibodies is described below.

Antibody binding may be detected by many different techniques including,but not limited to (e.g., radioimmunoassay, ELISA (enzyme-linkedimmunosorbant assay), “sandwich” immunoassays, immunoradiometric assays,gel diffusion precipitation reactions, immunodiffusion assays, in situimmunoassays (e.g., using colloidal gold, enzyme or radioisotope labels,for example), Western blots, precipitation reactions, agglutinationassays (e.g., gel agglutination assays, hemagglutination assays, etc.),complement fixation assays, immunofluorescence assays, protein A assays,and immunoelectrophoresis assays, etc.

In one embodiment, antibody binding is detected by detecting a label onthe primary antibody. In another embodiment, the primary antibody isdetected by detecting binding of a secondary antibody or reagent to theprimary antibody. In a further embodiment, the secondary antibody islabeled.

In some embodiments, an automated detection assay is utilized. Methodsfor the automation of immunoassays include those described in U.S. Pat.Nos. 5,885,530, 4,981,785, 6,159,750, and 5,358,691, each of which isherein incorporated by reference. In some embodiments, the analysis andpresentation of results is also automated. For example, in someembodiments, software that generates a prognosis based on the presenceor absence of a series of proteins corresponding to cancer markers isutilized.

In other embodiments, the immunoassay described in U.S. Pat. Nos.5,599,677 and 5,672,480; each of which is herein incorporated byreference.

In some embodiments, a computer-based analysis program is used totranslate the raw data generated by the detection assay (e.g., thepresence, absence, or amount of a given transcript or transcripts) intodata of predictive value for a clinician or researcher. The clinician orresearcher may access the predictive data using any suitable means.Thus, in some preferred embodiments, the present invention provides thefurther benefit that the clinician or researcher, who is not likely tobe trained in genetics or genomics, need not understand the raw data.The data is presented directly to the clinician or researcher in itsmost useful form. The clinician or researcher is then able toimmediately utilize the information in order to optimize the care of thesubject or advance the discovery objectives.

The present invention contemplates any method capable of receiving,processing, and transmitting the information to and from laboratoriesconducting the assays, wherein the information is provided to medicalpersonnel and/or subjects and/or researchers. For example, in someembodiments of the present invention, a sample (e.g., a biopsy or aserum or urine sample or perturbed cells or tissue) is obtained from asubject or experimental procedure and submitted to a profiling service(e.g., clinical laboratory at a medical facility, genomic profilingbusiness, etc.), located in any part of the world (e.g., in a countrydifferent than the country where the subject resides, the experimentperformed, or where the information is ultimately used) to generate rawdata. Where the sample may comprise a tissue or other biological sample,the subject may visit a medical center to have the sample obtained andsent to the profiling center, or subjects may collect the samplethemselves (e.g., a urine sample) and directly send it to a profilingcenter. Where the sample may comprise previously determined biologicalinformation, the information may be directly sent to the profilingservice by the subject (e.g., an information card containing theinformation may be scanned by a computer and the data transmitted to acomputer of the profiling center using an electronic communicationsystem). Once received by the profiling service, the sample is processedand a profile is produced (i.e., expression data) specific for thediagnostic or prognostic information desired for the subject, or thediscovery objective of the researcher.

The profile data is then prepared in a format suitable forinterpretation by a treating clinician or researcher. For example,rather than providing raw expression data, the prepared format mayrepresent a diagnosis or risk assessment for the subject, along withrecommendations for particular treatment options, ormechanism-of-action, protein-target prediction, or potential therapeuticuse for an experimental perturbagen. The data may be displayed to theclinician or researcher by any suitable method. For example, in someembodiments, the profiling service generates a report that may beprinted for the clinician or researcher (e.g., at the point of care orlaboratory) or displayed to the clinician or researcher on a computermonitor.

In some embodiments, the information is first analyzed at the point ofcare or laboratory or at a regional facility. The raw data is then sentto a central processing facility for further analysis and/or to convertthe raw data to information useful for a clinician, patient orresearcher. The central processing facility provides the advantage ofprivacy (all data is stored in a central facility with uniform securityprotocols), speed, and uniformity of data analysis. The centralprocessing facility may then control the fate of the data followingtreatment of the subject or completion of the experiment. For example,using an electronic communication system, the central facility mayprovide data to the clinician, the subject, or researchers.

In some embodiments, the subject is able to directly access the datausing the electronic communication system. The subject may chose furtherintervention or counseling based on the results. In some embodiments,the data is used for research use. For example, the data may be used tofurther optimize the inclusion or elimination of markers as usefulindicators of a particular condition or stage of disease.

One method for differentiating between cell types within a heterogeneouscell mixture has been reported that generates a multimodal distributionpattern following simultaneous flow cytometric data collection.Specifically, multimodal/multispectral images of a population of cellswere simultaneously collected, wherein photometric and/or morphometricfeatures identifiable in the images were used to separate the populationof cells into subpopulations. A multi-spectral flow cytometer wasconfigured to detect light signals generated by a variety of labels suchas, DAPI, FITC, dark field, PE, bright field, and Deep Red. Theserespective labels were conjugated to specific antibodies that haddifferential specific binding for normal cells versus diseased cells.Consequently, an abnormal ratio of detected cell patterns provides abasis for disease diagnosis. As such this method was limited to theability to detect and label antigenic sites on biological cell surfacesthat identified the cell's physiological state. Ortyn et al., “Blood AndCell Analysis Using An Imaging Flow Cytometer” United States PatentApplication 2009/0190822 (herein incorporated by reference).

A qualitative and quantitative assessment of a plurality of analytesfrom a biological sample using microwell technology has been developedwherein the biological analytes are attached to a lithographic grid viaknown biological recognition elements. Identification of the analytes isaccomplished by attaching luminescent labels having different emissionwavelengths to either the analyte or the recognition element.Consequently, the assay may differentiate between analytes by using twoor more labels having the same excitation wavelength, but differing inemission wavelength. Once the analytes are contacted with thelithographic grid, the analyte/recognition element complexes aredetected using optically generated luminescent detection technology.Cross-reactivity between analytes could be differentiated by providingrecognition elements having differing affinities for the respectiveanalytes. Pawlak et al., “Kit and method for determining a plurality ofanalytes” U.S. Pat. No. 7,396,675 (herein incorporated by reference).

A method specific for detecting circulating antibodies has been reportedthat uses microspheres conjugated to labeled antigens for theantibodies. The labeled antigens are usually other antibodies havingspecific affinity for species-specific Fc portions of a circulatingantibody. The labels are described as generally fluorescent labels thatare detected using a conventional flow cytometer. A multiplexcalibration technique is described that uses several subsets ofmicrospheres or beads, wherein the surface of each microsphere subsethas a different concentration of the same antigen. This calibrationprocedure thereby generates “a standard curve” such that theconcentration of a circulating antibody may be estimated. Connelly etal., “Method and composition for homogeneous multiplexedmicroparticle-based assay” U.S. Pat. No. 7,674,632 (herein incorporatedby reference).

Solution-based methods are generally based upon the use of detectabletarget-specific bead sets which comprise a capture probe coupled to adetectable bead, where the capture probe binds to an individual labeledtarget nucleic acid. Each population of bead sets is a collection ofindividual bead sets, each of which has a unique detectable label whichallows it to be distinguished from the other bead sets within thepopulation of bead sets (i.e., for example, ranging from 5-500 bead setsdepending upon assay sensitivity parameters). Any labels or signals maybe used to detect the bead sets as long as they provide uniquedetectable signals for each bead set within the population of bead setsto be processed in a single reaction. Detectable labels include but arenot limited to fluorescent labels and enzymatic labels, as well asmagnetic or paramagnetic particles (see, e.g., Dynabeads® (Dynal, Oslo,Norway)). The detectable label may be on the surface of the bead orwithin the interior of the bead.

The composition of the beads may vary. Suitable materials include, butare not limited to, any materials used as affinity matrices or supportsfor chemical and biological molecule syntheses and analyses, includingbut not limited to: polystyrene, polycarbonate, polypropylene, nylon,glass, dextran, chitin, sand, pumice, agarose, polysaccharides,dendrimers, buckyballs, polyacrylamide, silicon, rubber, and othermaterials used as supports for solid phase syntheses, affinityseparations and purifications, hybridization reactions, immunoassays andother such applications. Typically the beads have at least one dimensionin the 5-10 mm range or smaller. The beads may have any shape anddimensions, but typically have at least one dimension that is 100 mm orless, for example, 50 mm or less, 10 mm or less, 1 mm or less, 100 pm orless, 50 pm or less, and typically have a size that is 10 pm or lesssuch as, 1 pm or less, 100 nm or less, and 10 nm or less. In oneembodiment, the beads have at least one dimension between 2-20 pm. Suchbeads are often, but not necessarily, spherical e.g. elliptical. Suchreference, however, does not constrain the geometry of the matrix, whichmay be any shape, including random shapes, needles, fibers, andelongated. Roughly spherical, particularly microspheres that may be usedin the liquid phase, also are contemplated. The beads may includeadditional components, as long as the additional components do notinterfere with the methods and analyses herein.

Commercially available beads which may be used in the methods of thepresent invention include, but are not limited to, bead-basedtechnologies available from Luminex, Illumina, and/or Lynx. In oneembodiment, microbeads may be labeled with different spectral propertiesand/or fluorescent (or colorimetric) intensities. For example,polystyrene microspheres are provided by Luminex Corp, Austin, Tex. thatare internally dyed with two spectrally distinct fluorochromes. Usingprecise ratios of these fluorochromes, a large number of differentfluorescent bead sets may be produced (i.e., for example, 5-100 beadsets). Each set of the beads may be distinguished by its spectraladdress, a combination of which allows for measurement of a large numberof analytes in a single reaction vessel. Alternatively, a detectabletarget molecule may be labeled with a third fluorochrome. Because eachof the different bead sets is uniquely labeled with a distinguishablespectral address, the resulting hybridized bead-target complexes will bedistinguishable for each different target nucleic acid, which may bedetected by passing the hybridized bead-target complexes through arapidly flowing fluid stream. In the stream, the beads are interrogatedindividually as they pass two separate lasers. High speed digital signalprocessing classifies each of the beads based on its spectral addressand quantifies the reaction on the surface. Thousands of beads may beinterrogated per second, resulting a high speed, high throughput andaccurate detection of multiple different target nucleic acids in asingle reaction. In addition to a detectable label, the bead sets mayalso contain a capture probe which may bind to an individual targetanalyte. For example, a capture probe may comprise a nucleic acid, aprotein, a peptide, a biological receptor, an enzyme, a hormone, anantibody, a polyclonal antibody, a monoclonal antibody, and/or an Fabfragment. If the capture probe is a short unique DNA sequence, it maycomprise uniform hybridization characteristics with a target nucleicacid analyte. The capture probe may be coupled to the beads using anysuitable method which generates a stable linkage between probe and thebead, and permits handling of the bead without compromising the linkageusing further methods of the invention. Nucleic acid coupling reactionsinclude, but are not limited to, the use of capture probes modified witha 5′ amine for coupling to carboxylated microsphere or bead.

Most bead-based analyte detection systems are based upon Luminex coloredbeads, and/or the Luminex flow cytometric measurement system. The flowcytometric measurement system provides a summary report of medianfluorescent intensity (MFI) values for each measured analyte as well asbead-level output data for each sample. The bead-level output data isusually stored in a standard flow cytometry data format, includes a setmembership and fluorescent intensity of each individual bead that isdetected. Although it is not necessary to understand the mechanism of aninvention, it is believed that data collection and storage capabilitysuggests that the capacity of the commercially available Luminex systemmay be expanded beyond its commonly accepted limitations of 500bead-sets per well.

The Luminex xMAP® technology is a commercially available bead-basedsystem that has a limitation for simultaneous measurements of up to 500analytes per sample. Measurement instruments used to support Luminextechnology are basically flow cytometers capable of detecting and/oridentifying 500 color bead set variations. Usually, each specific colorbead variation provides a unique identification for an individualanalyte. In particular, the system assigns each bead detected in asample to a set based on its color. The system then summarizes themeasurement value for each set by reporting the median fluorescentintensity (MFI) of all beads belonging to that set.

Recent advances in biotechnology, and in particular genomics, haveexceeded the usefulness of data sets restricted to a 500 analyte assay.For example, in gene expression profiling, one might be interested inmeasuring the expression of more than 500 genes. One approach toovercome this limitation is to use two or more collections of the 500bead sets, wherein each collection interrogates a different set of 500genes. This approach requires measurement of the same sample in twoseparate wells to provide a complete assay. The problem with thisapproach is that it requires twice the amount of sample and takes twicethe amount of time for detection. Duplicate sampling techniques is alsoprone to failures since failure of a single well also renders the dataobtained from the duplicate sample well unusable. In addition, batchartifacts arise during the process of combining the wells thatconstitute a single sample.

The Luminex detector is analogous to a flow cytometer in that theinstrument measures the fluorescent intensity of beads upon passagethrough a flow chamber. Alternatively, the detector may be a chargedcoupled device. Generally, at least two fluorescence measurements arerecorded from a maximum of 500 differentially colored bead sets. As asingle analyte is usually attached to each differentially colored bead,the fluorescent counts may be used to uniquely identify individualanalytes. In particular, the system assigns each bead detected in asample to a set based on its color. A complete Luminex bead-set whichmay comprise these 500 differentially colored beads may be depictedusing a three dimensional coordinate plot. It is generally believed thatthe number of differentially colored beads that may be accuratelyclassified to a bead-color-region is limited by the overlapping spectralregimes of the different colors used. For example, a bead-color-regionmay include, but not be limited to 500 beads each identified by a unique3d coordinate using three classification laser measurements (CL1, CL2and CL3) In addition to classifying the beads, the instrument recordsanother fluorescence measurement known as a “reporter” for each bead.The “reporter” measurement is used to quantify the chemical reaction ofinterest and/or determine the presence or absence of an analyte (i.e.,for example, mRNA).

Microfluidic devices have also been suggested to be used with methodswhere labeled microspheres (Luminex beads) would simultaneously detectmultiple analytes in one of several sample chambers. These devices areconstructed by a process known as multilayer soft lithography (MSL) thatcreate multilayer microfluidic systems by binding multiple patternedlayers of elastomers. For example, the presence of the multi-layeredmicrochannels allows delivery of a different labeled microparticle to aspecific sample chamber where a different analyte is detected. Eachmicroparticle is specifically functionalized to bind a particularanalyte. Therefore, each microparticle in a given sample chamber iscapable of analyzing an analyte different from the analyte for eachother microparticle in the same sample chamber. As the delivery of eachmicrosphere is independently controlled, labeled microspheres may beadded to their respective samples chambers in different proportions,presumably to optimize the detection of each specific analyte (i.e., forexample, to prevent and/or overcome sample signal saturation). Dierckset al., “Multiplexed, microfluidic molecular assay device and assaymethod” United States Patent Application 2007/0183934 (hereinincorporated by reference).

Microspheres, such as Luminex beads, has been described as a platform tosupport the amplification of nucleic acids and production of proteins,in addition to the phototransfer from one substrate to anothersubstrate. In particular, the microspheres may be spectrally encodedthrough incorporation of semiconductor nanocrystals (or SCNCs). Adesired fluorescence characteristic may be obtained by mixing SCNCs ofdifferent sizes and/or compositions in a fixed amount and ratio tocreate a solution having a specific fluorescence spectra. Therefore, anumber of SCNC solutions may be prepared, each having a distinctdistribution of semiconductor nanocrystal labeled microsphere size andcomposition, wherein each solution has a different fluorescencecharacteristic. Further, these solutions may be mixed in fixedproportions to arrive at a spectrum having predetermined ratios andintensities of emission from the distinct SCNCs suspended in thatsolution. Lim et al., “Methods for capturing nascent proteins” UnitedStates Patent Application 2010/0075374 (herein incorporated byreference).

Luminex bead systems have been described to improve the detectionprecision of a single analyte. A set of differently numberedmicroparticles (i.e., for example, belonging to different bead-sets ordifferential colors) are all coated with the same reagent so as to makethem identical in sensitivity to the analyte being assayed. For example,an intra-assay titration curve may be constructed by coating the samefluorophore with different concentrations of labeled antibody, such thatthe same concentration of analyte is measured by detecting differentsignal magnitudes. Hanley B., “Intraplexing method for improvingprecision of suspended microarray assays” U.S. Pat. No. 7,501,290(herein incorporated by reference).

The use of color coded beads has been described which may comprisenucleic acid capture moieties capable of ‘tandem hybridization’ withtarget nucleotides. Generally, a short capture probe is present on acolor coded bead that binds a unique sequence of the target nucleicacid, while a longer labeled stacking probe has been preannealed to thetarget nucleic acid to facilitate subsequent detection. Each color codedbead therefore uniquely distinguishes between specific targetnucleotides based upon the capture moiety nucleic acid sequence. Beattieet al., “Nucleic acid analysis using sequence-targeted tandemhybridization” U.S. Pat. No. 6,268,147 (herein incorporated byreference).

A solution-based method for determining the expression level of apopulation of labeled target nucleic acids has been developed that isbased upon capturing the labeled target nucleic acids with color codedbeads. Each bead is conjugated to a specific capture probe that binds toan individual labeled target nucleic acid. Usually, the capture probesare nucleic acids capable of hybridization to the labeled target nucleicacids such that their respective expression level may be determinedwithin a biological sample. The method describes specific populations oftarget-specific bead sets, wherein each target-specific bead set isindividually detectable and hybridizes to only one target nucleic acid.Specifically, the target-specific bead sets are described as having atleast 5 individual bead sets that may bind with a corresponding set oftarget nucleic acids. As such, the bead population of a target-specificbead set may contain at least 100 individual beads that bind with acorresponding set of target nucleic acid. Golub et al., “Solution-basedmethods for RNA expression profiling” United States Patent Application2007/0065844 (herein incorporated by reference).

In one embodiment, the present invention contemplates a solution-basedmethod for highly multiplexed determination of populations of analytelevels present in a biological sample. For example, the population oftarget analytes may be a collection of individual target nucleic acidsof interest, such as a member of a gene expression signature or just aparticular gene of interest. Alternatively, the population of targetanalytes may be a collection of individual target proteins and/orpeptides. Each individual target analyte of interest is conjugated to adetectable solid substrate (i.e., for example, a differentially coloredbead) in a quantitative or semi-quantitative manner, such that the levelof each target analyte may be measured using a detectable signalgenerated by the detectable solid substrate. The detectable signal ofthe detectable solid substrate is sometimes referred to as the targetmolecule signal or simply as the target signal. The method also involvesa population of target-specific bead sets, where each target-specificbead set is individually detectable and has a capture probe whichcorresponds to an individual analyte. The population of analytes isattached in solution with the population of detectable solid substratesto form a solid substrate-analyte complex. To determine the level of thepopulation of target analytes present, one detects the solid substratesignal for each solid-substrate-analyte complex, such that the level ofthe solid substrate signal indicates the level of the target analyte,and the location of the solid substrate signal within a multi-modalsignal distribution pattern indicates the identity of the analyte beingdetected.

Limitations of existing bead-based systems is that, due to relativelylarge microliter-scale volume of sample used per well, each analyte mustbe assayed with multiple beads of the same type to prevent signalsaturation. Similar beads will compete with each other to bind to thesame analyte. This situation decreases the sensitivity of the assaybecause the target analyte present in the sample is distributed over allof the beads specific for that analyte; and each bead will be reportingonly a fraction of the analyte concentration. The mean value of theanalyte concentration will, therefore, have a large standard error dueto variable concentration values reported by each bead. The improvementsof bead-based analyte detection described below make possible a highlyaccurate, and sensitive, high capacity analyte detection system whereinan analyte may be detected using a single bead.

In one embodiment, the present invention contemplates a method which maycomprise combining a plurality of 500 bead-set collections in a singlewell, wherein each collection interrogates a different set of 500 genes.In one embodiment, the method further may comprise detecting theplurality of 500 bead-set collections using the single well. In oneembodiment, the method further may comprise generating a multi-modalfluorescent intensity distribution for each of the 500 bead colorvariations. Although it is not necessary to understand the mechanism ofan invention, it is believed that the number of beads that support eachmulti-modal peak may be determined by determining the local height andwidth. In one embodiment, the method further may comprise comparing thenumber of beads within a specific multi-modal peak to the mixingproportion of a bead for a specific gene. In one embodiment, themulti-modal peak bead number matches the bead mixing proportion suchthat the specific analyte is identified.

As detailed above, the standard commercially available high capacityanalyte detection systems are limited to simultaneously processing 500analytes. While the ability of measuring up to 500 analytes may besufficient for many applications, this limitation is restrictive formost practical genomics applications. For example, in assessingtranscriptome-wide gene expression profiling a practical assay requiresa simultaneous processing of much more than 500 genes.

One obvious approach to solve this problem would be to detect more than500 analytes (i.e., for example, 1,000 genes) by using two wells persample (i.e., for example, 500 genes per well×2 wells). This techniquewould then assay a complete collection of 500 differentially dyed beadsets in both wells, where the bead sets in the first well are coupled togenes 1-500 and the bead sets in the second well are couples to genes501-1000. Consequently, equal aliquots of a biological sample are addedto each well and detected separately. In order to determine the finalresult, the data from the two separate detections would have to becombined.

Several disadvantages are inherent in this approach including but notlimited to: i) logistically cumbersome; ii) requires twice as muchsample; iii) takes twice as much detection time; iv) loss of one wellcompromises both wells of data; or v) susceptible to batch artifactswhich makes it difficult to re-constitute the whole sample.

In one embodiment, the present invention contemplates a method which maycomprise interrogating multiple analytes, wherein said analytes areconjugated to individual, but identical, differentially colored beads.In one embodiment, a first analyte is conjugated to the individual, butidentical, differentially colored bead that is selected from a first 500bead-set. In one embodiment, a second analyte is conjugated to theindividual, but identical, differentially colored variant that isselected from a second 500 bead-set.

The Luminex bead-level intensity data distributions suggested thatexpansion of the system's capacity might be possible by combining twocollections of 500 bead-sets in a single well, wherein each 500 bead-setcollection interrogates a different set of 500 genes. This approachwould allow detection of a single sample in a single well. In someembodiments, various analytical methods are applied to the resultingbead level intensity data to obtain the correct identity for all 1,000analytes.

Usually, colored bead intensities belonging to a particular bead set aresummarized as a single value, wherein a median fluorescent intensity(MFI) is reported as the data point. For example, when the measuredanalytes are genes, the MFI of a particular bead set color representsthe expression value of a particular gene. A significant disadvantage tothe median-based algorithm is the presence of inaccuracy if the numberof outliers is significantly large (e.g. if a number of beads have anintensity value close to zero), or where low bead counts could lead tomisleading MFI values. For example, suggested Luminex data analysismethods ignore data wherein the bead count is less than thirty (30).

In addition to the MFI value, however, Luminex detectors also makeavailable data for each individual bead (e.g., bead-level data). Thesedata are stored in a standard flow cytometry data format (i.e., forexample, an LXB file) and include information such as, set membershipand/or a fluorescent intensity of each individual bead that is detected.Certain embodiments of the present invention have taken advantage ofthis alternative data by developing a kernel density based intensitysummarization method as an alternative to the default MFI summarizationmethod. In a kernel density method, a smoothed Gaussian density estimateis first fit to the data. A peak detection algorithm then detects localmaxima. The most prominent peak (defined as the peak which may comprisethe highest bead count) is reported as the summary intensity value.Unlike the standard MFI algorithm, the kernel algorithm may also ignorespurious outliers and/or identify analytes with low bead counts forfurther consideration. For example, the data presented herein show thedifferences between intensity distributions for two analytes between MFIvalues and kernel density based measurements.

Detection and analysis of multimodal peaks have been discussed inrelation to mass spectrometry analysis. Old et al., “Methods and systemsfor peak detection and quantitation” U.S. Pat. No. 7,279,679 (hereinincorporated by reference). However, some embodiments of the presentinvention provide significant improvements that provide superiordetection of analytes.

In one embodiment, the present invention contemplates a method which maycomprise detecting peaks from a multi-modal fluorescent intensitydistribution using an algorithm. In one embodiment, the algorithmrecovers an expression value of each gene interrogated with each beadcolor variation.

In one embodiment, the present invention contemplates a method forimproving the accuracy of the peak detection algorithm. In oneembodiment, the accuracy is improved by selecting paired genes. In oneembodiment, the paired genes are frequently distant. Although it is notnecessary to understand the mechanism of an invention, it is believedthat a linear programming approach may be employed to maximize thepairwise distances across all genes.

Peak detection usually involves the identification of sufficientstatistics comparing different populations from a multimodallydistributed signal pattern. For example, the statistical analysis mayidentify two different populations from a bi-modal distribution signalpattern. Generally, a first step in peak identification involvesassigning each data point (i.e., for example, a bead-level data point)to its most salient population. Once these data points have been mappedto their respective population, suitable statistics may be computed(i.e., for example, a median or mean) to summarize the values localizedto a population of interest.

A kernel density method may comprise a non-parametric method that doesnot make assumptions of the underlying distribution of the data. Ingeneral, the steps of the KDM algorithm may be performed in thefollowing manner: i) log transform the data; ii) obtain a smoothedGaussian kernel density estimate. An optimal bandwidth for the kernel ischosen automatically; iii) detect local maxima by comparing each elementof the smoothed estimate to its neighboring values. If an element islarger than both of its neighbors, it is a local peak; iv) assign everydata point to the nearest peak. The support for a peak is the number ofpoints that are assigned to it; and 5) rank order the peaks according tothe support.

Another method, the Gaussian mixture models, assumes that the signal isa mixture of two Gaussian populations. In general, the steps of the GMMalgorithm may be performed in the following manner: i) log transform thedata; and ii) assuming a mixture of two Gaussians, estimate the mean μ,the variance σ, and the mixing proportion π. φθ(y) is the normal densityevaluated at y given θ={μ,σ} as follows:

-   -   (a) Take initial guess at θ₁={μ, σ₁}, θ₂={μ₂, σ₂}, and π.    -   (b) Compute the membership probability, δ_(i), of each data        point γ_(i)

$\delta_{i} = {{\frac{{\pi\varphi}_{\theta_{2}}( y_{i} )}{{( {1 - \pi} ){\varphi_{\theta_{1}}( y_{i} )}} + {{\pi\varphi}_{\theta_{2}}( y_{i} )}}x} = \frac{\sum\limits_{i = 1}^{n}\; \delta_{i}}{n}}$

via

-   -   (c) Update the parameter vectors θ₁ and θ₂ with the following        update equations

${\overset{\_}{\mu}}_{1} = {{\frac{\sum\limits_{i = 1}^{n}\; {( {1 - \delta_{i}} )y_{i}}}{\sum\limits_{i = 1}^{n}\; ( {1 - \delta_{1}} )}\mspace{14mu} {\overset{\_}{\mu}}_{2}} = \frac{\sum\limits_{i = 1}^{n}\; {\delta_{i}y_{i}}}{\sum\limits_{i = 1}^{n}\; \delta_{i}}}$${\overset{\_}{\sigma}}_{1} = {{\frac{\sum\limits_{i = 1}^{n}\; {( {1 - \delta_{i}} )( {y_{i} - {\overset{\_}{\mu}}_{i}} )^{2}}}{\sum\limits_{i = 1}^{n}\; ( {1 - \delta_{i}} )}\mspace{14mu} {\overset{\_}{\sigma}}_{2}} = \frac{\sum\limits_{i = 1}^{n}\; {\delta_{i}( {y_{i} - {\overset{\_}{\mu}}_{1}} )}^{2}}{\sum\limits_{i = 1}^{n}\; \delta_{i}}}$

-   -   (d) Return to step (b) until convergence.

It should be noted that a GMM parameter estimation may be sensitive tonon-Gaussian components of the signal. Consequently, exploratory dataanalysis has resulted in a definition of a set of heuristics coupledwith GMM estimation, which produce accurate peak calls. For example, thedata presented herein shows an example output of the GMM for a singleanalyte measured using the dual tag approach.

In one embodiment, the present invention contemplates a peak detectionalgorithm further which may comprise a strategy to select paired genesfor conjugation to individual, but identical differentially coloredbeads. In one embodiment, the paired genes are frequently distant. Forexample, a linear programming approach is used to maximize the pairwisedistances across all genes. The optimization problem may be stated as:

Maximize:

$\sum\limits_{i}^{M}{\sum\limits_{i}^{M}{{d( {i,j} )}{x( {i,j} )}}}$

Where d(I,j) is the pairwise distance between the ith and jth gene. x isa symmetric binary matrix whose x(i,j)=1 is the ith and jth gene arepaired.

Subject to:

${\sum\limits_{i}^{M}{x( {i,j} )}} = 1$${\sum\limits_{j}^{M}{x( {i,j} )}} = 1$ x(i, j) = x(j, i)

In one embodiment, the present invention contemplates a peak detectionalgorithm further which may comprise a strategy under circumstanceswhere it is difficult to achieve exact mixing proportions of beads, theactual bead counts are measured and then employed as priors within thepeak assignment algorithm.

In one embodiment, the detected peak signal may be improved byconjugating every member of an analyte set to the same differentiallycolored bead. Although it is not necessary to understand the mechanismof an invention, it is believed that multiple analytes on the same beadcolor will increase the signal-to-noise ratio.

Once peaks within a multimodal distribution pattern have been detected,the peaks need unambiguous assignment to specific genes. In oneembodiment, the present invention contemplates a method for unambiguousgene assignment which may comprise combining a plurality of bead-setcollections, wherein each differentially colored bead is present in anunequal proportion between each bead-set collection. In one embodiment,a first differentially colored bead may be present in a proportion thatis 1.25 times the standard volume selected from a first bead-setcollection, while a second differentially colored bead, that isidentical to the first differentially colored bead, may be present in aproportion that is 0.75 times the standard volume selected from a secondbead-set collection. Then, by examining the support for each peak (e.g.peak height, neighboring bead count or mixing proportion) and using theprior knowledge of the mixing proportion of a bead for a specific gene,an unambiguous assignment for each gene is made.

mRNA expression may be measured by any suitable method, including butnot limited to, those disclosed below.

In some embodiments, RNA is detected by Northern blot analysis. Northernblot analysis involves the separation of RNA and hybridization of acomplementary labeled probe.

In other embodiments, RNA expression is detected by enzymatic cleavageof specific structures (INVADER assay, Third Wave Technologies; Seee.g., U.S. Pat. Nos. 5,846,717, 6,090,543; 6,001,567; 5,985,557; and5,994,069; each of which is herein incorporated by reference). TheINVADER assay detects specific nucleic acid (e.g., RNA) sequences byusing structure-specific enzymes to cleave a complex formed by thehybridization of overlapping oligonucleotide probes.

In still further embodiments, RNA (or corresponding cDNA) is detected byhybridization to a oligonucleotide probe. A variety of hybridizationassays using a variety of technologies for hybridization and detectionare available. For example, in some embodiments, TaqMan assay (PEBiosystems, Foster City, Calif.; See e.g., U.S. Pat. Nos. 5,962,233 and5,538,848, each of which is herein incorporated by reference) isutilized. The assay is performed during a PCR reaction. The TaqMan assayexploits the 5′-3′ exonuclease activity of the AMPLITAQ GOLD DNApolymerase. A probe consisting of an oligonucleotide with a 5′-reporterdye (e.g., a fluorescent dye) and a 3′-quencher dye is included in thePCR reaction. During PCR, if the probe is bound to its target, the 5′-3′nucleolytic activity of the AMPLITAQ GOLD polymerase cleaves the probebetween the reporter and the quencher dye. The separation of thereporter dye from the quencher dye results in an increase offluorescence. The signal accumulates with each cycle of PCR and may bemonitored with a fluorimeter.

In yet other embodiments, reverse-transcriptase PCR (RT-PCR) is used todetect the expression of RNA. In RT-PCR, RNA is enzymatically convertedto complementary DNA or “cDNA” using a reverse transcriptase enzyme. ThecDNA is then used as a template for a PCR reaction. PCR products may bedetected by any suitable method, including but not limited to, gelelectrophoresis and staining with a DNA specific stain or hybridizationto a labeled probe. In some embodiments, the quantitative reversetranscriptase PCR with standardized mixtures of competitive templatesmethod described in U.S. Pat. Nos. 5,639,606, 5,643,765, and 5,876,978(each of which is herein incorporated by reference) is utilized.

The method most commonly used as the basis for nucleic acid sequencing,or for identifying a target base, is the enzymatic chain-terminationmethod of Sanger. Traditionally, such methods relied on gelelectrophoresis to resolve, according to their size, wherein nucleicacid fragments are produced from a larger nucleic acid segment. However,in recent years various sequencing technologies have evolved which relyon a range of different detection strategies, such as mass spectrometryand array technologies.

One class of sequencing methods assuming importance in the art are thosewhich rely upon the detection of PPi release as the detection strategy.It has been found that such methods lend themselves admirably to largescale genomic projects or clinical sequencing or screening, whererelatively cost-effective units with high throughput are needed.

Methods of sequencing based on the concept of detecting inorganicpyrophosphate (PPi) which is released during a polymerase reaction havebeen described in the literature for example (WO 93/23564, WO 89/09283,WO 98/13523 and WO 98/28440). As each nucleotide is added to a growingnucleic acid strand during a polymerase reaction, a pyrophosphatemolecule is released. It has been found that pyrophosphate releasedunder these conditions may readily be detected, for example enzymicallye.g. by the generation of light in the luciferase-luciferin reaction.Such methods enable a base to be identified in a target position and DNAto be sequenced simply and rapidly whilst avoiding the need forelectrophoresis and the use of labels.

At its most basic, a PPi-based sequencing reaction involves simplycarrying out a primer-directed polymerase extension reaction, anddetecting whether or not that nucleotide has been incorporated bydetecting whether or not PPi has been released. Conveniently, thisdetection of PPi-release may be achieved enzymatically, and mostconveniently by means of a luciferase-based light detection reactiontermed ELIDA (see further below).

It has been found that dATP added as a nucleotide for incorporation,interferes with the luciferase reaction used for PPi detection.Accordingly, a major improvement to the basic PPi-based sequencingmethod has been to use, in place of dATP, a dATP analogue (specificallydATP_(alphaS)) which is incapable of acting as a substrate forluciferase, but which is nonetheless capable of being incorporated intoa nucleotide chain by a polymerase enzyme (WO98/13523).

Further improvements to the basic PPi-based sequencing technique includethe use of a nucleotide degrading enzyme such as apyrase during thepolymerase step, so that unincorporated nucleotides are degraded, asdescribed in WO 98/28440, and the use of a single-stranded nucleic acidbinding protein in the reaction mixture after annealing of the primersto the template, which has been found to have a beneficial effect inreducing the number of false signals, as described in WO00/43540.

In other embodiments, gene expression may be detected by measuring theexpression of a protein or polypeptide. Protein expression may bedetected by any suitable method. In some embodiments, proteins aredetected by immunohistochemistry. In other embodiments, proteins aredetected by their binding to an antibody raised against the protein. Thegeneration of antibodies is described below.

Antibody binding may be detected by many different techniques including,but not limited to (e.g., radioimmunoassay, ELISA (enzyme-linkedimmunosorbant assay), “sandwich” immunoassays, immunoradiometric assays,gel diffusion precipitation reactions, immunodiffusion assays, in situimmunoassays (e.g., using colloidal gold, enzyme or radioisotope labels,for example), Western blots, precipitation reactions, agglutinationassays (e.g., gel agglutination assays, hemagglutination assays, etc.),complement fixation assays, immunofluorescence assays, protein A assays,and immunoelectrophoresis assays, etc.

In one embodiment, antibody binding is detected by detecting a label onthe primary antibody. In another embodiment, the primary antibody isdetected by detecting binding of a secondary antibody or reagent to theprimary antibody. In a further embodiment, the secondary antibody islabeled.

In some embodiments, an automated detection assay is utilized. Methodsfor the automation of immunoassays include those described in U.S. Pat.Nos. 5,885,530, 4,981,785, 6,159,750, and 5,358,691, each of which isherein incorporated by reference. In some embodiments, the analysis andpresentation of results is also automated. For example, in someembodiments, software that generates a prognosis based on the presenceor absence of a series of proteins corresponding to cancer markers isutilized.

In other embodiments, the immunoassay described in U.S. Pat. Nos.5,599,677 and 5,672,480; each of which is herein incorporated byreference.

In some embodiments, a computer-based analysis program is used totranslate the raw data generated by the detection assay (e.g., thepresence, absence, or amount of a given marker or markers) into data ofpredictive value for a clinician. The clinician may access thepredictive data using any suitable means. Thus, in some preferredembodiments, the present invention provides the further benefit that theclinician, who is not likely to be trained in genetics or molecularbiology, need not understand the raw data. The data is presenteddirectly to the clinician in its most useful form. The clinician is thenable to immediately utilize the information in order to optimize thecare of the subject.

The present invention contemplates any method capable of receiving,processing, and transmitting the information to and from laboratoriesconducting the assays, wherein the information is provided to medicalpersonnel and/or subjects. For example, in some embodiments of thepresent invention, a sample (e.g., a biopsy or a serum or urine sample)is obtained from a subject and submitted to a profiling service (e.g.,clinical lab at a medical facility, genomic profiling business, etc.),located in any part of the world (e.g., in a country different than thecountry where the subject resides or where the information is ultimatelyused) to generate raw data. Where the sample may comprise a tissue orother biological sample, the subject may visit a medical center to havethe sample obtained and sent to the profiling center, or subjects maycollect the sample themselves (e.g., a urine sample) and directly sendit to a profiling center. Where the sample may comprise previouslydetermined biological information, the information may be directly sentto the profiling service by the subject (e.g., an information cardcontaining the information may be scanned by a computer and the datatransmitted to a computer of the profiling center using an electroniccommunication systems). Once received by the profiling service, thesample is processed and a profile is produced (i.e., expression data),specific for the diagnostic or prognostic information desired for thesubject.

The profile data is then prepared in a format suitable forinterpretation by a treating clinician. For example, rather thanproviding raw expression data, the prepared format may represent adiagnosis or risk assessment for the subject, along with recommendationsfor particular treatment options. The data may be displayed to theclinician by any suitable method. For example, in some embodiments, theprofiling service generates a report that may be printed for theclinician (e.g., at the point of care) or displayed to the clinician ona computer monitor.

In some embodiments, the information is first analyzed at the point ofcare or at a regional facility. The raw data is then sent to a centralprocessing facility for further analysis and/or to convert the raw datato information useful for a clinician or patient. The central processingfacility provides the advantage of privacy (all data is stored in acentral facility with uniform security protocols), speed, and uniformityof data analysis. The central processing facility may then control thefate of the data following treatment of the subject. For example, usingan electronic communication system, the central facility may providedata to the clinician, the subject, or researchers.

In some embodiments, the subject is able to directly access the datausing the electronic communication system. The subject may choosefurther intervention or counseling based on the results. In someembodiments, the data is used for research use. For example, the datamay be used to further optimize the inclusion or elimination of markersas useful indicators of a particular condition or stage of disease.

In one embodiment, the present invention contemplates kits for thepractice of the methods of this invention. The kits preferably includeone or more containers containing various compositions and/or reagentsto perform methods of this invention. The kit may optionally include aplurality of cluster centroid landmark transcripts. The kit mayoptionally include a plurality of nucleic-acid sequences wherein thesequence is complementary to at least a portion of a cluster centroidlandmark transcript sequence, and wherein the sequences may optionallycomprise a primer sequence and/or a barcode nucleic-acid sequence. Thekit may optionally include a plurality of optically addressed beads,wherein each bead may comprise a different nucleic-acid sequence that iscomplementary to a barcode nucleic-acid sequence.

The kit may optionally include enzymes capable of performing PCR (i.e.,for example, DNA polymerase, thermostable polymerase). The kit mayoptionally include enzymes capable of performing nucleic-acid ligation(for example, a ligase). The kit may optionally include buffers,excipients, diluents, biochemicals and/or other enzymes or proteins. Thekits may also optionally include appropriate systems (e.g. opaquecontainers) or stabilizers (e.g. antioxidants) to prevent degradation ofthe reagents by light or other adverse conditions.

The kits may optionally include instructional materials containingdirections (i.e., protocols) providing for the use of the reagents inthe performance of any method described herein. While the instructionalmaterials typically comprise written or printed materials they are notlimited to such. Any medium capable of storing such instructions andcommunicating them to an end user is contemplated by this invention.Such media include, but are not limited to electronic storage media(e.g., magnetic discs, tapes, cartridges, chips), optical media (e.g.,CD ROM), and the like. Such media may include addresses to internetsites that provide such instructional materials.

The kits may optionally include computer software (i.e., algorithms,formulae, instrument settings, instructions for robots, etc) providingfor the performance of any method described herein, simplification orautomation of any method described herein, or manipulation, analysis,display or visualization of data generated thereby. Any medium capableof storing such software and conveying it to an end user is contemplatedby this invention. Such media include, but are not limited to,electronic storage media (e.g., magnetic discs), optical media (e.g., CDROM), and the like. Such media may include addresses to internet sitesthat provide such software.

In other embodiments, the present invention provides kits for thedetection and characterization of proteins and/or nucleic acids. In someembodiments, the kits contain antibodies specific for a proteinexpressed from a gene of interest, in addition to detection reagents andbuffers. In other embodiments, the kits contain reagents specific forthe detection of mRNA or cDNA (e.g., oligonucleotide probes or primers).In preferred embodiments, the kits contain all of the componentsnecessary to perform a detection assay, including all controls,directions for performing assays, and any necessary software foranalysis and presentation of results.

Samples (i.e., for example, biological samples) may be optionallyconcentrated using a commercially available concentration filter, forexample, an Amicon or Millipore Pellicon ultrafiltration unit. Followingthe concentration step, the concentrate may be applied to a suitablepurification matrix as previously described. For example, a suitableaffinity matrix may comprise a ligand or antibody molecule bound to asuitable support. Alternatively, an anion exchange resin may beemployed, for example, a matrix or substrate having pendantdiethylaminoethyl (DEAE) groups. The matrices may be acrylamide,agarose, dextran, cellulose or other types commonly employed in proteinpurification. Alternatively, a cation exchange step may be employed.Suitable cation exchangers include various insoluble matrices which maycomprise sulfopropyl or carboxymethyl groups. Sulfopropyl groups arepreferred.

Finally, one or more reversed-phase high performance liquidchromatography (RP-HPLC) steps employing hydrophobic RP-HPLC media,e.g., silica gel having pendant methyl or other aliphatic groups, may beemployed to further purify an IL-1R composition. Some or all of theforegoing purification steps, in various combinations, may also beemployed to provide a substantially pure recombinant protein.

Protein may be isolated by initial extraction from cell pellets,followed by one or more concentration, salting-out, hydrophobicinteraction chromatography (HIC), aqueous ion exchange or size exclusionchromatography steps. Finally, high performance liquid chromatography(HPLC) may be employed for final purification steps. Most biologicalcells may be disrupted by any convenient method, including freeze-thawcycling, sonication, mechanical disruption, or use of cell lysingagents.

The present invention provides isolated antibodies (i.e., for example,polyclonal or monoclonal). In one embodiment, the present inventionprovides antibodies that specifically bind to a subset of a solidparticle population. These antibodies find use in the detection methodsdescribed above.

An antibody against a protein of the present invention may be anymonoclonal or polyclonal antibody, as long as it may recognize theprotein. Antibodies may be produced by using a protein of the presentinvention as the antigen according to a conventional antibody orantiserum preparation process.

The present invention contemplates the use of both monoclonal andpolyclonal antibodies. Any suitable method may be used to generate theantibodies used in the methods and compositions of the presentinvention, including but not limited to, those disclosed herein. Forexample, for preparation of a monoclonal antibody, protein, as such, ortogether with a suitable carrier or diluent is administered to an animal(e.g., a mammal) under conditions that permit the production ofantibodies. For enhancing the antibody production capability, completeor incomplete Freund's adjuvant may be administered. Normally, theprotein is administered once every 2 weeks to 6 weeks, in total, about 2times to about 10 times. Animals suitable for use in such methodsinclude, but are not limited to, primates, rabbits, dogs, guinea pigs,mice, rats, sheep, goats, etc.

For preparing monoclonal antibody-producing cells, an individual animalwhose antibody titer has been confirmed (e.g., a mouse) is selected, and2 days to 5 days after the final immunization, its spleen or lymph nodeis harvested and antibody-producing cells contained therein are fusedwith myeloma cells to prepare the desired monoclonal antibody producerhybridoma. Measurement of the antibody titer in antiserum may be carriedout, for example, by reacting the labeled protein, as describedhereinafter and antiserum and then measuring the activity of thelabeling agent bound to the antibody. The cell fusion may be carried outaccording to known methods, for example, the method described by Koehlerand Milstein (Nature 256:495 [1975]). As a fusion promoter, for example,polyethylene glycol (PEG) or Sendai virus (HVJ), preferably PEG is used.

Examples of myeloma cells include NS-1, P3U1, SP2/0, AP-1 and the like.The proportion of the number of antibody producer cells (spleen cells)and the number of myeloma cells to be used is preferably about 1:1 toabout 20:1. PEG (preferably PEG 1000- PEG 6000) is preferably added inconcentration of about 10% to about 80%. Cell fusion may be carried outefficiently by incubating a mixture of both cells at about 20° C. toabout 40° C., preferably about 30° C. to about 37° C. for about 1 minuteto 10 minutes.

Various methods may be used for screening for a hybridoma producing theantibody (e.g., against a tumor antigen or autoantibody of the presentinvention). For example, where a supernatant of the hybridoma is addedto a solid phase (e.g., microplate) to which antibody is adsorbeddirectly or together with a carrier and then an anti-immunoglobulinantibody (if mouse cells are used in cell fusion, anti-mouseimmunoglobulin antibody is used) or Protein A labeled with a radioactivesubstance or an enzyme is added to detect the monoclonal antibodyagainst the protein bound to the solid phase. Alternately, a supernatantof the hybridoma is added to a solid phase to which ananti-immunoglobulin antibody or Protein A is adsorbed and then theprotein labeled with a radioactive substance or an enzyme is added todetect the monoclonal antibody against the protein bound to the solidphase.

Selection of the monoclonal antibody may be carried out according to anyknown method or its modification. Normally, a medium for animal cells towhich HAT (hypoxanthine, aminopterin, thymidine) are added is employed.Any selection and growth medium may be employed as long as the hybridomamay grow. For example, RPMI 1640 medium containing 1% to 20%, preferably10% to 20% fetal bovine serum, GIT medium containing 1% to 10% fetalbovine serum, a serum free medium for cultivation of a hybridoma(SFM-101, Nissui Seiyaku) and the like may be used. Normally, thecultivation is carried out at 20° C. to 40° C., preferably 37° C. forabout 5 days to 3 weeks, preferably 1 week to 2 weeks under about 5% CO2gas. The antibody titer of the supernatant of a hybridoma culture may bemeasured according to the same manner as described above with respect tothe antibody titer of the anti-protein in the antiserum.

Separation and purification of a monoclonal antibody may be carried outaccording to the same manner as those of conventional polyclonalantibodies such as separation and purification of immunoglobulins, forexample, salting-out, alcoholic precipitation, isoelectric pointprecipitation, electrophoresis, adsorption and desorption with ionexchangers (e.g., DEAE), ultracentrifugation, gel filtration, or aspecific purification method wherein only an antibody is collected withan active adsorbent such as an antigen-binding solid phase, Protein A orProtein G and dissociating the binding to obtain the antibody.

Polyclonal antibodies may be prepared by any known method ormodifications of these methods including obtaining antibodies frompatients. For example, a complex of an immunogen (an antigen against theprotein) and a carrier protein is prepared and an animal is immunized bythe complex according to the same manner as that described with respectto the above monoclonal antibody preparation. A material containing theantibody against is recovered from the immunized animal and the antibodyis separated and purified.

As to the complex of the immunogen and the carrier protein to be usedfor immunization of an animal, any carrier protein and any mixingproportion of the carrier and a hapten may be employed as long as anantibody against the hapten, which is crosslinked on the carrier andused for immunization, is produced efficiently. For example, bovineserum albumin, bovine cycloglobulin, keyhole limpet hemocyanin, etc. maybe coupled to a hapten in a weight ratio of about 0.1 part to about 20parts, preferably, about 1 part to about 5 parts per 1 part of thehapten.

In addition, various condensing agents may be used for coupling of ahapten and a carrier. For example, glutaraldehyde, carbodiimide,maleimide activated ester, activated ester reagents containing thiolgroup or dithiopyridyl group, and the like find use with the presentinvention. The condensation product as such or together with a suitablecarrier or diluent is administered to a site of an animal that permitsthe antibody production. For enhancing the antibody productioncapability, complete or incomplete Freund's adjuvant may beadministered. Normally, the protein is administered once every 2 weeksto 6 weeks, in total, about 3 times to about 10 times.

The polyclonal antibody is recovered from blood, ascites and the like,of an animal immunized by the above method. The antibody titer in theantiserum may be measured according to the same manner as that describedabove with respect to the supernatant of the hybridoma culture.Separation and purification of the antibody may be carried out accordingto the same separation and purification method of immunoglobulin as thatdescribed with respect to the above monoclonal antibody.

The protein used herein as the immunogen is not limited to anyparticular type of immunogen. For example, a protein expressed resultingfrom a virus infection (further including a gene having a nucleotidesequence partly altered) may be used as the immunogen. Further,fragments of the protein may be used. Fragments may be obtained by anymethods including, but not limited to expressing a fragment of the gene,enzymatic processing of the protein, chemical synthesis, and the like.

Although the present invention and its advantages have been described indetail, it should be understood that various changes, substitutions andalterations may be made herein without departing from the spirit andscope of the invention as defined in the appended claims.

The present invention will be further illustrated in the followingExamples which are given for illustration purposes only and are notintended to limit the invention in any way.

EXAMPLES Example 1 Identification of Cluster Centroid LandmarkTranscripts and Creation of a Dependency Matrix

The present example describes one method for the identification ofcluster centroid landmark transcripts having inferential relationships.

Thirty-five thousand eight-hundred and sixty-seven transcriptome-widegene-expression profiles generated with the Affymetrix U133 family ofoligonucleotide microarrays were downloaded from NCBI' s Gene ExpressionOmnibus (GEO) repository in the form of .cel files. The .cel files werepreprocessed to produce average-difference values (i.e. expressionlevels) for each probe set using MASS (Affymetrix). Expression levels ineach profile were then scaled with respect to the expression levels of350 previously-determined invariant probe sets whose expression levelstogether spanned the range of expression levels observed. The minimalcommon feature space in the dataset was determined to be 22,268 probesets.

The quality of each profile was assessed by reference to twodata-quality metrics: percentage of P-calls and 3′ :5′ ratios. Empiricaldistributions of both metrics were built and the 10% of profiles at bothextremes of each distribution were eliminated from furtherconsideration. A total of 16,428 profiles remained after this qualityfiltering. A further 1,941 profiles were found to be from a singlesource, and were also eliminated.

Probe sets below a predetermined arbitrary detection threshold of 20average-difference units in over 99% of the profiles were eliminated,bringing the total number of probe sets under consideration to 14,812.

Principal component analysis (PCA) dimensionality reduction was thenapplied to the dataset (i.e. 14,487 samples×14,812 features).Two-hundred eight-seven components were identified that explained 90% ofthe variation in the dataset. The matrix of the PCA loadings of thefeatures in the eigenspace (i.e. 287×14,812) was then clustered usingk-means. The k-means clustering was repeated a number of times becausethe high-dimensionality matrix obtained partitions non-deterministicallybased on the starting seeds, and the results were used to build agene-by-gene pairwise consensus matrix.

Pockets of high local correlation were identified by hierarchicallyclustering the gene-by-gene pairwise consensus matrix. The leaves oneach node of the dendrogram ‘tree’ together constitute a cluster. Thetree was then cut a multiple levels to identify 100, 300, 500, 700,1,000, 1,500, 2,000, 5,000, and 10,000 clusters.

The probe sets whose individual expression-level vector across all14,487 profiles most closely correlated with that of the mean of allprobe sets in each cluster was selected as the centroid of that cluster.This produced sets of 100, 300, 500, 700, 1,000, 1,500, 2,000, 5,000,and 10,000 centroid probe sets. Multiple individual probe sets hadattributes that approximate the definition of a centroid probe set ofany given cluster.

A dependency matrix was created for each set of centroid probe sets bylinear regression between the expression levels of the g centroid probesets and the remaining 14,812-g probe sets in the space of the 14,487profiles. A pseudo-inverse was used because the number of profiles didnot necessarily match the number of features being modeled. Dependencymatrices were thereby populated with weights (i.e. factors) relating theexpression level of each non-centroid probe set to the expression levelof each centroid probe set.

The identity and gene symbol of the transcript represented by eachcentroid probe set was determined using a mapping provided by Affymetrix(affymetrix.com) and taken as a ‘cluster centroid landmark transcript.’Non-centroid probe sets were mapped to gene symbols in the same manner.

Example II Determining a Suitable Number of Cluster Centroid LandmarkTranscripts

The present example describes one method for selecting the number ofcluster centroid landmark transcripts required to create usefultranscriptome-wide gene-expression profiles. This method makes use of alarge collection of transcriptome-wide gene-expression profiles producedfrom cultured human cells treated with small-molecule perturbagens madewith Affymetrix oligonucleotide microarrays provided in build02 of thepublic Connectivity Map resource (broadinstitute.org/cmap). One use ofConnectivity Map is the identification of similarities between thebiological effects of small-molecule perturbagens. This is achieved bydetecting similarities in the gene-expression profiles produced bytreating cells with those perturbagens (Lamb et al., “The ConnectivityMap: using gene-expression signatures to connect small molecules, genesand disease” Science 313:1929 2006), and represents one valuableapplication of transcriptome-wide gene-expression profiling. In summaryof the present method, expression values for the sets of clustercentroid landmark transcripts (specifically their corresponding probesets) identified according to Example I (above) are extracted from theConnectivity Map data and used to create transcriptome-widegene-expression profiles using the dependency matrices generated alsoaccording to Example I (above). Note that the collection of expressionprofiles used in Example I did not include any Connectivity Map data.The proportion of similarities identified using the actualtranscriptome-wide gene-expression profiles also identified by theinferred transcriptome-wide gene-expression profiles created fromdifferent numbers of cluster centroid landmark transcript measurementsare then compared.

First, a matrix of enrichment scores was constructed by executing 184independent query signatures obtained from Lamb et al. and the MolecularSignatures Database (MSigDB; release 1.5;broadinstitute.org/gsea/msigdb) against the full Connectivity Mapdataset, as described (Lamb et al.) producing a ‘reference connectivitymatrix’ (i.e. 184 queries×1,309 treatments).

The 7,056 transcriptome-wide gene-expression profiles were downloadedfrom the Connectivity Map website in the form of .cel files. The .celfiles were then preprocessed to produce average-difference values (i.e.expression levels) for each probe set using MASS (Affymetrix).Expression levels for each set of centroid probe sets were extracted,and 9×7,056 sets of transcriptome-wide gene-expression profiles createdusing the corresponding dependency matrices; expression levels ofnon-centroid probe sets were computed by multiplying the expressionlevels for each centroid probe set by their dependency-matrix factorsand summed. Rank-ordered lists of probe sets were computed for eachtreatment-and-vehicle pair using these (inferred) transcriptome-widegene-expression profiles as described (Lamb et al.). Matrices ofenrichment scores were created for each of the 9 datasets with the setof 184 query signatures exactly as was done to create the referenceconnectivity matrix.

The number of query signatures for which the treatment with the highestenrichment score in the reference connectivity matrix was also the topscoring treatment in the connectivity matrix produced from each of the 9inferred datasets was plotted (FIG. 2). The dataset generated usingexpression values for only 1,000 centroid probe sets identified the sametreatment as the dataset generated using expression values for all22,283 probe sets in 147 of 184 (80%) of cases. These findings indicatethat 1,000 cluster centroid landmark transcripts may be used to createuseful transcriptome-wide gene-expression profiles.

Example III Platform-Specific Selection of Cluster Centroid LandmarkTranscripts

This example describes one method for validating the performance ofcluster centroid landmark transcripts on a selected moderate-multiplexassay platform. This example relates specifically to the measurement ofexpression levels of cluster centroid landmark transcripts derived fromgene-expression profiles generated using Affymetrix microarrays usingthe LMF method of Peck et al., “A method for high-throughput geneexpression signature analysis” Genome Biology 7:R61 (2006). See FIG. 3.

Probe pairs were designed for 1,000 cluster centroid landmarktranscripts selected according to Example I (above) as described by Pecket al. The expression levels of these transcripts were measured by LMFin a collection of 384 biological samples which may comprise unperturbedcell lines, cell lines treated with bioactive small molecules, andtissue specimens for which transcriptome-wide gene-expression profilesgenerated using Affymetrix microarrays was available. A plot ofnormalized expression level measured by LMF against normalizedexpression level measured by Affymetrix microarray for a representativecluster centroid landmark transcript (217995_at:SQRDL) across all 384biological samples is shown as FIG. 4. Vectors of expression levelsacross all 384 samples were constructed for every feature from bothmeasurement platforms.

For each cluster centroid landmark transcript, the corresponding LMFvector was used as the index in a nearest-neighbors analysis to rank theAffymetrix probe sets. Cluster centroid landmark transcripts wereconsidered to be ‘validated’ for measurement by LMF when the Affymetrixprobe set mapping to that cluster centroid landmark transcript had arank of 5 or greater, and the Affymetrix probe sets mapping to 80% ormore of the non-centroid transcripts in the corresponding cluster had arank of 100 or greater.

Not all attempts to create validated cluster centroid landmarktranscripts were successful. Transcripts failing to meet the validationcriteria were found to be of two types: (1) simple, where themeasurements of the centroid transcript itself were poorly correlatedacross the 384 samples; and (2) complex, where the measurements of thecentroid transcripts were well correlated but those levels were not wellcorrelated with those of the non-centroid transcripts from its cluster.Neither type of failure could be anticipated. A plot of normalizedexpression levels determined by LMF and Affymetrix microarray for threevalidated transcripts (218039_at:NUSAP1, 201145_at:HAX1, 217874_at:SUCLG1), one representative type-1 failure (202209_at:LSM3), and onerepresentative type-2 failure (217762_at:RAB31) in one of the 384biological samples is presented as FIG. 5. A plot of normalizedexpression levels determined by LMF and Affymetrix microarray for one ofthese validated transcripts and the same representative type-2 failurein a different one of the 384 biological samples is presented as FIG.6A. FIG. 6B shows the expression levels of the same transcripts in thesame biological sample together with those of three transcripts fromtheir clusters (measured using Affymetrix microarray only). Only theexpression level of the validated transcript (218039_at:NUSAP1) iscorrelated with the levels of the transcripts in its cluster(35685_at:RING1, 36004_at:IKBKG, 41160_at:MBD3). The expression level ofthe type-2 failed transcript (217762_at:RAB31) is not correlated withthe levels of all of the transcripts in its cluster (48612_at:N4BP1,57516_at:ZNF764, 57539_at:ZGPAT). A representative list of transcriptsexhibiting simple (type 1) failures, together with the gene-specificportions of their LMF probe pairs, is provided as Table 1. Arepresentative list of transcripts exhibiting complex (type 2) failures,together with the gene-specific portions of their LMF probe pairs isprovided as Table 2.

The use of alternative probe pairs allowed a proportion of failedcluster centroid landmark transcripts to be validated. When this was notsuccessful, failed cluster centroid landmark transcripts weresubstituted with other transcripts from the same cluster. This processwas continued until validated cluster centroid landmark transcripts forall 1,000 clusters were obtained. The list of these landmarktranscripts, together with the gene-specific portions of theircorresponding LMF probe pairs, is provided in Table 3. A dependencymatrix specific for this set of validated landmark transcripts wascreated according to Example I (above).

Example IV Generation and Use of Transcriptome-Wide Gene-ExpressionProfiles Made by Measurement of 1,000 Transcripts

This example described one method for the generation oftranscriptome-wide gene-expression profiles using measurement of theexpression levels of a sub-transcriptome number of cluster centroidlandmark transcripts. The present method uses the LMF moderate multiplexgene-expression analysis platform described by Peck et al. (“A methodfor high-throughput gene expression signature analysis” Genome Biology7:R61 2006), the Luminex FlexMAP 3D optically-addressed microspheres andflow-cytometric detection system, 1,000 cluster centroid landmarktranscripts (and corresponding gene-specific sequences) validated forLMF from Example III (above), a corresponding dependency matrix fromExample III (above), 50 empirically-determined invariant transcriptswith expression levels spanning the range of those observed, and 1,050barcode sequences developed. The FlexMAP 3D system allows simultaneousquantification of 500 distinct analytes in samples arrayed in the wellsof a 384-well plate. Measurement of the expression levels of 1,000landmark transcripts plus 50 invariant transcripts was therefore dividedover 3 wells. Four hundred landmark transcripts were assayed in onewell, and three hundred landmark transcripts were assayed in each of 2additional wells. The 50 invariant genes were assayed in all 3 wells.This overall method, referred to herein as L1000, was then used togenerate a total of 1,152 transcriptome-wide gene-expression profilesfrom cultured human cells treated with each of 137 distinct bioactivesmall molecules. These data were used to create an analog of a smallportion of Connectivity Map de novo, and the relative performance of theL1000 version compared to that of the original.

LMF probe pairs were constructed for each of the 1,000 landmark and 50invariant transcripts such that each pair incorporated one of the 1,050barcode sequences. Probes were mixed in equimolar amounts to form aprobe-pair pool. Capture probes complementary to each of the barcodesequences were obtained and coupled to one of 500 homogenous populationsof optically-distinguishable microspheres using standard procedures.Three pools of capture-probe expressing microspheres were created: onepool contained beads coupled to capture probes complementary to thebarcodes in 400 of the landmark probe pairs, a second pool containedbeads matching a different 300 landmark probes, and a third poolcontained beads matching the remaining 300 landmark probes. Each poolcontained beads expressing barcodes matching the probe pairscorresponding to the 50 invariant transcripts.

MCF7 cells were treated with small molecules and corresponding vehiclesin 384-well plates. Cells were lysed, mRNA captured, first-strand cDNAsynthesized, and ligation-mediated amplification performed using the1,000 landmark plus 50 invariant transcript probe-pair pool inaccordance with the published LMF method (Peck et al.). The ampliconpools obtained after the PCR step were divided between 3 wells of fresh384-well plates, and each hybridized to one of the three bead pools at abead density of approximately 500 beads of each address per well, alsoin accordance with the published LMF method. The captured amplicons werelabeled with phycoerythrin and the resulting microsphere populationswere analyzed using a FlexMAP 3D instrument in accordance with themanufacturer's instructions.

Median fluorescence intensity (MFI) values from each microspherepopulation from each detection well were associated with theircorresponding transcript and sample. MFI values for each landmarktranscript were scaled relative to those for the set of invarianttranscripts obtained from the same detection well, and all scaled MFIvalues derived from the same samples were concatenated to produce a listof normalized expression levels for each of the 1,000 landmarktranscripts in each treatment sample.

Predicted expression levels for transcripts that were not measured werecalculated by multiplying the expression levels of each of the landmarktranscripts by the weights contained in the dependency matrix, andsummed. Computed and measured expression levels were combined to createfull-transcriptome gene-expression profiles for each sample.Rank-ordered lists of transcripts were computed for each pair oftreatment and corresponding vehicle-control profiles as described byLamb et al. (“The Connectivity Map: using gene-expression signatures toconnect small molecules, genes and disease” Science 313: 1929-19352006), resulting in an analog of the Connectivity Map dataset containinga total of 782 small-molecule treatment instances.

Enrichment scores for each of the perturbagens in the originalConnectivity Map (created with Affymetrix microarrays) and the L1000analog were computed according to the method of Lamb et al. for apublished query signature derived from an independent transcriptome-widegene-expression analysis of the effects of three biochemically-verifiedhistone-deacetylase (HDAC) inhibitor compounds. Glaser et al., “Geneexpression profiling of multiple histone deacetylase (HDAC) inhibitors:defining a common gene set produced by HDAC inhibition in T24 and MDAcarcinoma cell lines.” Mol Cancer Ther 2:151-163 (2003). As anticipated,the small molecule with the highest score in the original AffymetrixConnectivity Map was vorinostat, an established HDAC inhibitor(enrichment score=0.973, n=12, p-value<0.001). However, vorinostat wasalso the highest scoring perturbagen in the L1000 dataset (score=0.921,n=8, p-value<0.001). See FIG. 7. An additional 95 query signatures wereexecuted against both datasets. The perturbagen with the highest scorein the original Connectivity Map also had the highest score of those inthe L1000 dataset in 79 (83%) of those cases.

These data show that L1000 may substitute for a technology that directlymeasures the expression levels of all transcripts in thetranscriptome-specifically, Affymetrix high-density oligonucleotidemicroarrays-in one useful application of transcriptome—widegene-expression profiling.

Example V Use of Transcriptome-Wide Gene-Expression Profiles Made byMeasurement of 1,000 Transcripts for Clustering of Cell Lines

Transcriptome-wide gene-expression profiles were generated from totalRNA isolated from 44 cultured human cancer cells lines derived from sixtissue types using measurement of the expression levels of asub-transcriptome number of cluster centroid transcripts and inferenceof the remaining transcripts according to the L1000 methods described inExample IV. Full-transcriptome gene-expression data were produced fromthese same total RNA samples using Affymetrix U133 Plus 2.0 high-densityoligonucleotide microarrays for comparison.

Cell lines were grouped together according to consensus hierarchicalclustering of their corresponding gene-expression profiles (Monti et al“Consensus Clustering: A resampling-based method for class discovery andvisualization of gene expression microarray data.” Machine LearningJournal 52: 91-118 2003). The similarity metric used was Pearsoncorrelation. One hundred twenty-five clustering iterations were made. Ineach iteration, 38 (85%) of the samples were used and 6 excluded.

As anticipated, the results of the consensus clustering made with theAffymetrix data placed cell lines from the same tissue in the samebranch of the dendrogram, with only few exceptions (FIG. 8A). Manysimilar such findings have been reported. Ross et al., “Systematicvariation in gene expression patterns in human cancer cell lines” NatureGenetics 24: 227-235 2000). Remarkably, clustering of the L1000 dataalso placed cell lines with the same tissues of origin in the samebranch of the dendrogram (FIG. 8B).

This example shows that L1000 may substitute for a technology thatdirectly measures the expression levels of all transcripts in thetranscriptome-specifically, Affymetrix high-density oligonucleotidemicroarrays-in a second useful application of transcriptome-widegene-expression profiling; that is, grouping of samples on the basis ofbiological similarity.

Example VI Use of Transcriptome-Wide Gene-Expression Profiles Made byMeasurement of 1,000 Transcripts for Gene-Set Enrichment Analysis

The expression levels of 1,000 cluster centroid transcripts weremeasured in primary human macrophages following treatment withlipopolysaccharide (LPS) or vehicle control, and used to creategene-expression profiles composed of expression levels for 22,268transcripts, according to the L1000 methods described in Example IV.These data were used as input for a Gene-Set Enrichment Analysis (GSEA)with a library of 512 gene sets from version 3 of the MolecularSignatures Database (Subramanian et al., “Gene set enrichment analysis:A knowledge-based approach for interpreting genome-wide expressionprofiles” Proc Natl Acad Sci 102: 15545-15550 2005).

LPS is known to be a potent activator of the NF-KB transcription-factorcomplex (Qin et al., “LPS induces CD40 gene expression through theactivation of NF-κB and STAT-1α in macrophages and microglia” Blood 106:3114-3122 2005). It was therefore not unexpected that a gene setcomposed of 23 members of the canonical NF-κB signaling pathway(BIOCARTA_NFKB_PATHWAY) received the highest score of all gene setstested (p<0.001). This example shows that L1000 may generate datacompatible with a third useful application of full-transcriptomegene-expression profiling; that is, gene-set enrichment analysis.However, closer examination of the analysis revealed that none of the 23transcripts in the BIOCARTA_NFKB_PATHWAY gene set had been explicitlymeasured. This example then also demonstrates the utility of the methodeven in the extreme case when the expression levels of all of thetranscripts of interest were inferred.

Example VII Creation of a Full-Transcriptome Gene-Expression Dataset ofUnprecedented Size

The L1000 methods described in Example IV were used to create aconnectivity map with in excess of 100,000 full-transcriptomegene-expression profiles from a panel of cultured human cells treatedwith a diversity of chemical and genetic perturbations at a range ofdoses and treatment durations.

Creation of a dataset of this size is impractical with existingtranscriptome-wide gene-expression profiling technologies (e.g.Affymetrix GeneChip) due to high cost and low throughput. This exampletherefore demonstrates the transformative effect of the presentinvention on the field of gene-expression profiling in general, and itspotential to impact medically-relevant problems in particular.

TABLE 1 Representative Type I (simple)Landmark Transcript/Probe-Pair Failures ## name alternate nameleft probe sequence right probe sequence  1 FFA6B6 200058_s_at:SNRNP200CCATCAAGAGGCTGACCTTG CAGCAGAAGGCCAAGGTGAA  2 RE1F1 200064_at:HSP90AB1GGCGATGAGGATGCGTCTCG CATGGAAGAAGTCGATTAGG  3 YC7D7 200729_s_at:ACTR2GAAAATCCTATTTATGAATC CTGTCGGTATTCCTTGGTAT  4 GGG6H6 200792_at:XRCC6TGCTGGAAGCCCTCACCAAG CACTTCCAGGACTGACCAGA  5 CC1D1 200870_at:STRAPGTGTCAGATGAAGGGAGGTG GAGTTATCCTCTTATAGTAC  6 AG12H12 200991_s_at:SNX17TTCTCTTGGCCAGGGGCCTC GTATCCTACCTTTCCTTGTC  7 DDC7D7 201488_x_at:KHDRBS1TCTTGTATCTCCCAGGATTC CTGTTGCTTTACCCACAACA  8 BBA1B1 201511_at:AAMPCACGTCAGGAGACCACAAAG CGAAAGTATTTTGTGTCCAA  9 LG12H12 201620_at:MBTPS1CAGGGGAAGGATGTACTTTC CAAACAAATGATACAACCCT 10 YC12D12 201652_at:COPS5AAAGTTAGAGCAGTCAGAAG CCCAGCTGGGACGAGGGAGT 11 FFE11F11 201683_x_at:TOX4AATGACAGACATGACATCTG GCTTGATGGGGCATAGCCAG 12 FFG11H11 201684_s_at:TOX4TTATCTGCTGGGAAAGTGTC CAAGAGCCTGTTTITGAAAC 13 OG3H3 201696_at:SFRS4TAACCTGGACGGCTCTAAGG CTGGAATGACCACATAGGTA 14 YA1B1 201710_at:MYBL2ATGTTTACAGGGGTTGTGGG GGCAGAGGGGGTCTGTGAAT 15 VC3D3 201729_s_at:KIAA0100GGCAGGCGCAAATGATTTGG CGATTCGAGTGGCTGCAGTA 16 AAC9D9 201773_at:ADNPACTTAGTTTTTGCACATAAC CTTGTACAATCTTGCAACAG 17 BBA7B7 201949_x_at:CAPZBAGCTCTGGGAGCAGAGGTGG CCCTCGGTGCCGTCCTGCGC 18 CCE4F4 202116_at:DPF2TTGTTCTTCCTGGACCTGGG CATTCAGCCTCCTGCTCTTA 19 ME8F8 202123_s_at:ABL1CGACTGCCTGTCTCCATGAG GTACTGGTCCCTTCCTTTTG 20 UUA11B11 202178_at:PRKCZCACGGAAACAGAACTCGATG CACTGACCTGCTCCGCCAGG 21 MA1B1 202261_at:VPS72TGTTCCGTTTCTTCTCCCTG CTTCTCCCCTTTGTCATCTC 22 RG1H1 202298_at:NDUFA1GCTCATTTTGGGTATCACTG GAGTCTGATGGAAAGAGATA 23 OE2F2 202408_s_at:PRPF31CCGCCCAGTATGGGCTAGAG CAGGTCTTCATCATGCCTTG 24 LC9D9 202452_at:ZER1CCTGGGGAGCAGCGCTAACC CTGGAGGCAGCCTTTGGGTG 25 ZC12D12 202477_s_at:TUBGCP2ACACGGAGCGCCTGGAGCGC CTGTCTGCAGAGAGGAGCCA 26 UUE8F8 202717_s_at:CDC16ACTCTGCTATTGGATATATC CACAGTCTGATGGGCAACTT 27 VA5B5 202757_at:COBRA1ACGGGGCCAGCTGGACACAC GGTGAGATTTTCTCGTATGT 28 EEE4F4 203118_at:PCSK7CCTGTCTTCCTCTGCAAGTG CTCAGGGAAATGGCCTTCCC 29 AAA12B12 203154_s_at:PAK4TCATTTTATAACACTCTAGC CCCTGCCCTTATTGGGGGAC 30 LE8F8 203190_at:NDUFS8CCACGGAGACCCATGAGGAG CTGCTGTACAACAAGGAGAA 31 ZC9D9 203201_at:PMM2GGAAGGATCCCGGGTCTCAG CTAGAACACGGTGGAAGAGA 32 BE3F3 203517_at:MTX2TCTGTAGGAGAATTGAACAG CACTATTTTGAAGATCGTGG 33 TE8F8 203530_s_at:STX4CATCACCGTCGTCCTCCTAG CAGTCATCATTGGCGTCACA 34 FFC9D9 203572_s_at:TAF6CCTCTGGTCCTGGGAGTGTC CAGAAGTACATCGTGGTCTC 35 MC4D4 204549_at:IKBKEAGGGCAGTAGGTCAAACGAC CTCATCACAGTCTTCCTTCC 36 UC11D11 204757_s_at:C2CD2LGCCTCTGAGAATGTTGGCAG CTCACAGAGAGCAGGGCCGG 37 FFE1F1 206050_s_at:RNH1GTCCTGTACGACATTTACTG GTCTGAGGAGATGGAGGACC 38 AAA1B1 206075_s_at:CSNK2A1CTCCCAGGCTCCTTACCTTG GTCTTTTCCCTGTTCATCTC 39 SG10H10 207988_s_at:ARPC2TAAGAGGAGGAAGCGGCTGG CAACTGAAGGCTGGAACACT 40 AE8F8 208093_s_at:NDEL1GCATGTTAATGACTCTGATG GTGTCCTCCTCTGGGCAGCT 41 CG1H1 208152_s_at:DDX21GGAAGTTAAGGTTTCCTCAG CCACCTGCCGAACAGTTTCT 42 GGG9H9 208174_x_at:ZRSR2TCGGGAGAGGCACAATTCAC GAAGCAGAGGAAGAAATAGG 43 EEA12B12 208720_s_at:RBM39GATGGGATACCGAGATTAAG GATGATGTGATTGAAGAATG 44 BA10B10 208887_at:EIF3GGCTAAGGACAAGACCACTGG CCAATCCAAGGGCTTCGCCT 45 EEA6B6 208996_s_at:POLR2CCCAGTGCACCTGTAGGGAAC CAACTAGACTTCTCTCCTGG 46 JE11F11 209044_x_at:SF3B4TCCCCCTCACTACCTTCCTC CTGTACAACTTTGCTGACCT 47 SE12F12 209659_s_at:CDC16AAACGGGGCTTACGCCATTG GAAACCTCAAGGAAAACTCC 48 IIA3B3 210947_s_at:MSH3TGGAATTGCCATTGCCTATG CTACACTTGAGTATTTCATC 49 YYA10B10 211233_x_at:ESR1CTGCTGGCTACATCATCTCG GTTCCGCATGATGAATCTGC 50 FFC1D1 212047_s_at:RNF167GTGACCTATTTGCACAGACC GTCGTCTTCCCTCCAGTCTT 51 TTC2D2 212087_s_at:ERAL1CACAGGAGGCAGGCCATGAC CTCATGGACATCTTCCTCTG 52 UUA10B10 212216_at:PREPLCCTGAAATTCTGAAACACTG CATTCAACTGGGAATTGGAA 53 OA4B4 212544_at:ZNHIT3AGGTCATGCAGGCCTTTACC GGCATTGATGTGGCTCATGT 54 DDG6H6 212564_at:KCTD2ACGCAGGTGATGCCAGCCAG GCCCAGGAGTGCCCAGCATC 55 IIE7F7 212822_at:HEG1GCGGATGAACTGACATGCTC CTACCATGACCAGGCTCTGG 56 ZG12H12 212872_s_at:MED20AAGCCTCTGCAACAAGTCAG GTGGTGGTCATGTTTCCCTT 57 NC5D5 212968_at:RFNGACCACAGAGATGTTTTCTCC GCTCTGACTTGTGGCTCAGG 58 GGA5B5 214004_s_at:VGLL4GCCAAAGCTCTGGGTGACAC GTGGCTCCAGATCAAAGCGG 59 AAC1D1 216525_x_at:PMS2L3TTTCTACCTGCCACGCGTCG GTGAAGGTTGGGACTCGACT 60 FFA9B9 217832_at:SYNCRIPTATATCACATACCCAATAGG CACCACGATGAAGATCAGAG 61 BG1H1 217987_at:ASNSD1TTTTACGCCTTGCAGCTGTG GAACTTGGTCTTACAGCCTC 62 UUC9D9 218114_at:GGA1TGGGGCACCTAGAGTTCTCG GTGTGTCTCCTTCATTCATT 63 LE4F4 218386_x_at:USP16CAGCGACACACATGTGCAAG CTGTGCCTACAACTAAAGTA 64 FFE3F3 218649_x_at:SDCCAG1GAAACTGAACAGTGAAGTGG CTTGATTGCTTAAACTATTG 65 NG4H4 218725_at:SLC25A22CTGGCCATGTGATCGTGTTG GTGACAGACCCTGATGTGCT 66 BBE10F10 218760_at:COQ6GGCTTTGGGGATATCTCCAG CTTGGCCCATCACCTCAGTA 67 BE11F11 202209_at:LSM3GCCCCTCCACTGAGAGTTGG CTGAAACAAAGAATTTGTCC

TABLE 2 Representative Type II (complex)Landmark Transcript/Probe-Pair Failures ## name alternate nameleft probe sequence right probe sequence  1 AA3B3 221049_s_at:POLLATTTTAAGCAGGAGCAGGTG GCTGGTTTGAAGCCCCAGGT  2 AAG3H3 41160_at:MBD3GCTCCCTGTCAGAGTCAAAG CACAAATCCTCAGGACGGGC  3 AC6D6 218912_at:GCC1TTTCTGCCCAGTGGGTCTTG GCATAAGTAGATTAATCCTG  4 AE7F7 221560_at:MARK4GAGTTAAAGAAGAGGCGTGG GAATCCAGGCAGTGGTTTTT  5 AG4H4 219445_at:GLTSCR1AACAAGAAACTGGGGTCTTC CTCTCCCCCGAACCTCTCCC  6 CA6B6 218936_s_at:CCDC59GCCTCTGAAGGAAGGTTGGC CTGAAGAACTGAAAGAACCT  7 FFA4B4 221471_at:SERINC3CTTCCCTAGAAGAATGGTTG CTGATATGGCTACTGCTTCT  8 GGA1B1 221490_at:UBAP1GGTTCTGCAATATCTCTGAG GTGCAAAGAATGCACTTTTC  9 HHG1H1 222039_at:KIF18BTGAAGATGTGGATGATAATG GTGCCTTGATTTCCAAATGA 10 VG10H10 217762_s_at:RAB31GAACAATCAAAGTTGAGAAG CCAACCATGCAAGCCAGCCG 11 NA5B5 221196_x_at:BRCC3GTTGCCAGGGATAGGGACTG GAGGGGGTGTGGGGTATGTA 12 RRE10F10 222351_at:PPP2R1BAGAGGACATGGGGAAGGGAC CAGTGTATCAGTTGCGTGGA 13 SSE6F6 220079_s_at:USP48AGATGCGTTGGTCCATAAAG GATTGTATCAAGTAGATGGG 14 TA5B5 221567_at:NOL3GTGAGACTAGAAGAGGGGAG CAGAAAGGGACCTTGAGTAG 15 UUG11H11 221858_at:TBC1D12ATGGGTCATTCTAGTCTAAG GACTACTAGTAGAACCCTCA 16 WC8D8 90610_at:LRCH4AAGACGCGCCTGGGCTCCGC GCTCTCAGAGAAGCACGTGG 17 XE6F6 222199_s_at:BIN3ACGACTGAGCCCTGCTTCTG CTGGGGCTGTGTACAGAGTG 18 YG1H1 221856_s_at:FAM63ACTAGGATTGGTGGGTTTCTG GTTCTCAACTCCCGGTCCCT

TABLE 3 Representative Cluster Centroid LandmarkTranscripts/Probe Pairs Validated for LMF gene ## name Affymetrix symbolleft probe sequence right probe sequence    1 QC7D7 209083_at CORO1ACCCTCCTCATCTCCCTCAAG GATGGCTACGTACCCCCAAA    2 AAAG5H5 221223_x_at CISHTGTGTCTCACCCCCTCACAG GACAGAGCTGTATCTGCATA    3 TE6F6 203458_at SPRGGAAAGAGTGATCTGGTGTC GAATAGGAGGACCCATGTAG    4 MME12F12 203217_s_atST3GAL5 AACTGTGAAGCCACCCTGGG CTACAGAAACCACAGTCTTC    5 LLLC12D12202862_at FAH TCCATGTTGGAACTGTCGTG GAAGGGAACGAAGCCCATAG    6 IIC3D3201393_s_at IGF2R AGAAGCAAACCGCCCTGCAG CATCCCTCAGCCTGTACCGG    7 PPE8F8203233_at IL4R CGGGCAATCCAGACAGCAGG CATAAGGCACCAGTTACCCT    8 MMMA8B8209531_at GSTZ1 TAGGGAGATGCGGGGAGCAG GGTGGGCAGGAATACTGTTA    9 BBE6F6218462_at BXDC5 ATCCTCAATTTATCGGAAGG CAGGTTGCCACATTCCACAA   10 IIG7H7213417_at TBX2 TAGACCGCGTGATAAAACTG GGTTGAGGGATGCTGGAACC   11 NNA11B11201795_at LBR TGGTGGCGTTTTCTGTACTG GATTGCACCAAGGAAGCTTT   12 XG1H1204752_x_at PARP2 TGGGAGTACAGTGCCATTAG GACCAGCAAGTGACACAGGA   13 YA8B8200713_s_at MAPRE1 CTTTGTTTGGCAGGATTCTG CAAAATGTGTCTCACCCACT   14MMME2F2 203138_at HAT1 AGCTGGAAGAGAGTTTTCAG GAACTAGTGGAAGATTACCG   15NG5H5 209515_s_at RAB27A ACTGTACTTGCTGGGTCTTG CCAAGATCATTTATTCCGCT   16SSG2H2 211605_s_at RARA CTCTCATCCAGGAAATGTTG GAGAACTCAGAGGGCCTGGA   17PG4H4 201078_at TM9SF2 TTACCAAAATATACAGTGTG GTGAAGGTTGACTGAAGAAG   18TE2F2 202401_s_at SRF GGTGATATTTTTATGTGCAG CGACCCTTGGTGTTTCCCTT   19ZZE5F5 203787_at SSBP2 GCTCCTGCCCCCTCCCTGAA CTATTTTGTGCTGTGTATAT   20MMG1H1 200972_at TSPAN3 GACTGATGCCGAAATGTCAC CAGGTCCTTTCAGTCTTCAC   21XXG10H10 217766_s_at TMEM50A AAAAGCATGATTCCCACAAG GACTAAGTATCAGTGATTTG  22 MC1D1 212166_at XPO7 GTGGATATTTATATATGTAC CCTGCACTCATGAATGTATG   23JJG3H3 204812_at ZW10 GGCCCTAGCTTTGGAACGAG GAATTGGGAGATTCCAGGAG   24ZZE7F7 218489_s_at ALAD CTGATGGCACATGGACTTGG CAACAGGGTATCGGTGATGA   25NA4B4 201739_at SGK1 TAGTATATTTAAACTTACAG GCTTATTTGTAATGTAAACC   26IIIA7B7 206770_s_at SLC35A3 CAAGACTGCTGAAAGCAATC CAGTTGCTCCTGTGCTAGAT  27 QQC6D6 205774_at F12 GATTCCGCAGTGAGAGAGTG GCTGGGGCATGGAAGGCAAG   28NNE10F10 201611_s_at ICMT GCCTTAGGTAGTTGGGCTTG CCCACCCTAGTTTGCTTTTG   29VA3B3 209092_s_at GLOD4 ATGAGTGTGTGACGTTGCTG CACGCCTGACTCTGTGCGAG   30LLA1B1 219382_at SERTAD3 GAAAGCTGGGCCTGTCGAAG GATGACAGGGATGTGCTGCC   31NNE9F9 217872_at PIH1D1 AAGCCTCACCTGAACCTGTG GCTGGAAGCCCCCGACCTCC   32KKE12F12 207196_s_at TNIP1 CACAGTAGCCTTGCTGAAGC CATCACAGATGGGAGAAGGC  33 NG12H12 202417_at KEAP TACATAGAAGCCACCGGATG GCACTTCCCCACCGGATGGA  34 XG8H8 203630_s_at COG5 TTCACTAAATAAGCATGTAG CTCAGTGGTTTCCAAATTTG  35 OOA7B7 219952_s_at MCOLN1 ATTCGACCTGACTGCCGTTG GACCGTAGGCCCTGGACTGC  36 PPA9B9 203291_at CNOT4 ACGAGGGCACTCTGAGATAG CACTGCTCTGGGGCCATCTG  37 HHHA5B5 217789_at SNX6 GCAGGTTTGCTTGACCTCTG CCTCAGTTCTCGACTCTAAA  38 LLA7B7 203117_s_at PAN2 AGCAAGTAGAGTGTTGGTGG CCCAAGCAAACCAGTGTTGC  39 QG3H3 202673_at DPM1 GATGGAGATGATTGTTCGGG CAAGACAGTTGAATTATACT   40MC11D11 203373_at SOCS2 AAAAACCAATGTAGGTATAG GCATTCTACCCTTTGAAATA   41VVA2B2 217719_at EIF3L TTATGGGGATTTCTTCATCC GTCAGATCCACAAATTTGAG   42FFFC6D6 210695_s_at WWOX CTGCTTGGTGTGTAGGTTCC GTATCTCCCTGGAGAAGCAC   43MMG8H8 201829_at NET1 GTGTAGTAAGTTGTAGAAGG CTCGAGGGGACGTGGACTTA   44JJJE10F10 203379_at RPS6KA1 CACACACCTCCGAGACAGTC CAGTGTCACCTCTCTCAGAG  45 TTC4D4 204757_s_at C2CD2L AGACCAGCACCAGTGTCTGC CTCTGAGAATGTTGGCAGCT  46 HHC11D11 203725_at GADD45A TCAACTACATGTTCTGGGGGCCCGGAGATAGATGACTTTG   47 LLE12F12 202466_at POLS GGGTGTGCATTTTAAAACTCGATTCATAGACACAGGTACC   48 IIE1F1 212124_at ZMIZ1 CATAAACACACCCACCAGTGCAGCCTGAAGTAACTCCCAC   49 HHG8H8 200816_s_at PAFAH1B1AAGCTGGATTTACAGGTCAC GGCTGGACTGAATGGGCCTT   50 JJJA2B2 202635_s_atPOLR2K AATCAGATGCAGAGAATGTG GATACAGAATAATGTACAAG   51 JJJA10B10203186_s_at S100A4 TGGACAGCAACAGGGACAAC GAGGTGGACTTCCAAGAGTA   52 IIA5B5207163_s_at AKT1 TAGCACTTGACCTTTTCGAC GCTTAACCTTTCCGCTGTCG   53 RA9B9218346_s_at SESN1 CAGCACCAAAGTTGTGGGAC ATGTTGCTGTAGACTGCTGC   54 NA8B8201896_s_at PSRC1 GAATTTTATCTTCTTCCTTG GCATTGGTTCACTGGACATT   55 MME3F3203013_at ECD GACCAGGAACTAGCACACAC CTGCATCAGCAAAAGTTTCA   56 IIIE12F12207620_s_at CASK AAAAGCCTCTTTGTTATCGG CCTTGTGTCAGCAGGTCATG   57 ZE4F4201980_s_at RSU1 CAACACTTCATTCTCTCTTG CCCTGTCTCTCAAATAAACC   58 OE6F6204825_at MELK GCTGCAAGGTATAATTGATG GATTCTTCCATCCTGCCGGA   59 ZZA12B12201170_s_at BHLHE40 ACTTGTTTTCCCGATGTGTC CAGCCAGCTCCGCAGCAGCT   60ZZE11F11 211715_s_at BDH1 CTGCGAATGCAGATCATGAC CCACTTGCCTGGAGCCATCT   61NNG3H3 208078_s_at SIK1 TTGGGGCAGCCAGGCCCTTG CCTTCATTTTTACAGAGGTA   62QC3D3 203338_at PPP2R5E CGTTCTATATCTCATCACAG CGCCAGCCCTGTTTTTAGCC   63MMMG11H11 217956_s_at ENOPH1 ACAGCAAGCAGTTGCCTTAC CAGTGAAAAAGGTGCACTGA  64 JJJA9B9 202095_s_at BIRC5 CCAACCTTCACATCTGTCAC GTTCTCCACACGGGGGAGAG  65 MMME3F3 216836_s_at ERBB2 TCCCTGAAACCTAGTACTGC CCCCCATGAGGAAGGAACAG  66 LLLE10F10 212694_s_at PCCB TCCACACGTGCCCGAATCTGCTGTGACCTGGATGTCTTGG   67 ZZC6D6 204497_at ADCY9 TGAGAGCCCCACAGGCTCTGCCACACCCGTGACTTCATCC   68 UUC1D1 221142_s_at PECR GTGTCCTCCATCCCCCAGTGCCTTCACATCTTGAGGATAT   69 RE10F10 203246_s_at TUSC4 ATCTGCTGGAAGTGAGGCTGGTAGTGACTGGATGGACACA   70 XE5F5 203071_at SEMA3B CAGGCCCTGGCTGAGGGCAGCTGCGCGGGCTTATTTATTA   71 LLLC6D6 217784_at YKT6 AGGACCCTGGGGAGAGATGGGGGCGGGGAAAATGGAGGTA   72 LE10F10 202784_s_at NNT CTATGCTGCAGTGGACAATCCAATCTTCTACAAACCTAAC   73 NNNE6F6 200887_s_at STAT1 TGTAACTGCATTGAGAACTGCATATGTTTCGCTGATATAT   74 WWC5D5 202540_s_at HMGCR GACTCTGAAAAACATTCCAGGAAACCATGGCAGCATGGAG   75 MMG6H6 220643_s_at FAIM TGGTAAAAAATTGGAGACAGCGGGTGAGTTTGTAGATGAT   76 ZG7H7 202446_s_at PLSCR1 AAATCAGGAGTGTGGTAGTGGATTAGTGAAAGTCTCCTCA   77 HHHG9H9 219888_at SPAG4 GCTGGGCTTTTGAAGGCGACCAAGGCCAGGTGGTGATCCA   78 EEEE11F11 204653_at TFAP2AGTATTCTGTATTTTCACTGG CCATATTGGAAGCAGTTCTA   79 MME5F5 217080_s_at HOMER2AAACAAGCTTCTGGTGGGTG CATTTTCTGGCCCGGAGTTG   80 NE9F9 212846_at RRP1BCTAAGTAAAATTGCCAAGTG GACTTGGAAGTCCAGAAAGG   81 YYA9B9 203442_x_at EML3GCCTTGACTCCCGCTGCCTG CTGAGGGGCAATAAACCAGA   82 HHE2F2 202324_s_at ACBD3AGCTCATAGGTGTTCATACT GTTACATCCAGAACATTTGT   83 NNNA5B5 214473_x_atPMS2L3 CATCAGAATTACTTTGAAGG CTACTATTAATATGCAGACT   84 PA1B1 203008_x_atTXNDC9 TGATGTTGAATCAACTGATG CCAGCAGAAAGCTATTTTGA   85 KKKC9D9209526_s_at HDGFRP3 TTTCCTCTCTGTGACAGAAC CCAGGAATTAATTCCTAAAT   86PPG5H5 202794_at INPP1 GCAGAGACGCATACCTAGAG GAACTCTAACCCCGGTGTAC   87OA6B6 202990_at PYGL CAAAGGCCTGGAACACAATG GTACTCAAAAACATAGCTGC   88QQC5D5 205452_at PIGB CACTTCCCATGAGATTTCTC CAGTGCCCGCCAGACCTGAC   89UG11H11 204458_at PLA2G15 TTTTCTCTGTTGCATACATG CCTGGCATCTGTCTCCCCTT   90QE4F4 207842_s_at CASC3 GGTGGTTGTGCCTTTTGTAG GCTGTTCCCTTTGCCTTAAA   91QQA9B9 211071_s_at MLLT11 CTTCACACCTACTCACTTTA CAACTTTGCTCCTAACTGTG   92PC12D12 206846_s_at HDAC6 CCCATCCTGAATATCCTTTG CAACTCCCCAAGAGTGCTTA   93SSC3D3 201498_at USP7 TGCTGCCTTGGCAGACTTAC GATCTCAACAGTTCATACGA   94IIIG4H4 213851_at TMEM110 GACCACCGAGTGGCAAGGTG GAAGGAAGCACAGGCACACA   95RRG5H5 219492_at CHIC2 AGTATGTTGTCTTTCCAATG GTGCCTTGCTTGGTGCTCTC   96PPG4H4 202703_at DUSP11 ATTCTACCTGGAGACCAGAG CTGGCCTGAAAATTACTGGT   97ZA4B4 218145_at TRIB3 TCTAACTCAAGACTGTTCTG GAATGAGGGTCCAGGCCTGT   98MC7D7 212255_s_at ATP2C1 CCAGGAGTGCCATATTTCAG CTACTGTATTTCCTTTTTCT   99VE9F9 200083_at USP22 CACCACTGCAACATATAGAC CTGAGTGCTATTGTATTTTG  100SG7H7 202630_at APPBP2 CTTCATTGTGTCAGGATGAC CTTTCATATCATTCTCACCA  101RC2D2 201774_s_at NCAPD2 CTGTGCAGGGTATCCTGTAG GGTGACCTGGAATTCGAATT  102AAA7B7 203279_at EDEM1 TCACAGGGCTCAGGGTTATG CTCCCGCTTGAATCTGGACG  103RRA12B12 204225_at HDAC4 GGCTAAGATTTCACTTTAAG CAGTCGTGAACTGTGCGAGC  104UE5F5 201671_x_at USP14 TCAGTCAGATTCTTTCCTTG GCTCAGTTGTGTTTGTATTT  105NNNA8B8 218046_s_at MRPS16 CACCAATCGGCCGTTCTACC GCATTGTGGCTGCTCACAAC 106 HHC8D8 209263_x_at TSPAN4 CACCTACATTCCATAGTGGG CCCGTGGGGCTCCTGGTGCA 107 QE3F3 200621_at CSRP1 AGGCATGGGCTGTACCCAAG CTGATTTCTCATCTGGTCAA 108 KKA2B2 200766_at CTSD GGGGTAGAGCTGATCCAGAG CACAGATCTGTTTCGTGCAT 109 YA5B5 201985_at KIAA0196 GTGCCCTTCTGTTCCTGGAG GATTATGTTCGGTACACAAA 110 HHG5H5 203154_s_at PAK4 CCTGCAGCAAATGACTACTG CACCTGGACAGCCTCCTCTT 111 PPG1H1 202284_s_at CDKN1A CAGACATTTTAAGATGGTGG CAGTAGAGGCTATGGACAGG 112 EEEA11B11 218584_at TCTN1 TGCAGAGGCAGGCTTCAGAG CTCCACCAGCCATCAATGCC 113 VE10F10 212943_at KIAA0528 CCCCCAGGACAACAAACTGCCCTTAAGAGTCATTTCCTTG  114 ZZA5B5 204656_at SHB TCCAAAGAGATGCCTTCCAGGATGAACAAAGGCAGACCAG  115 EEEG6H6 205573_s_at SNX7 TGCTAATAATGCCCTGAAAGCAGATTGGGAGAGATGGAAA  116 OOE7F7 200670_at XBP1 AGTTTGCTTCTGTAAGCAACGGGAACACCTGCTGAGGGGG  117 YYC10D10 201328_at ETS2 TCTGTTTACTAGCTGCGTGGCCTTGGACGGGTGGCTGACA  118 QQE9F9 212765_at CAMSAP1L1GTTTCATGGACACTGTTGAG CAATGTACAGTGTATGGTGT  119 IIE12F12 202986_at ARNT2GTGCAGGCACATTTCCAAGC GTAGGTGTCCCTGGCTTTTG  120 XA8B8 201997_s_at SPENAGACTGGCTAACCCCTCTTC CTATTACCTTGATCTCTTCC  121 VA8B8 203218_at MAPK9CATGTGACCACAAATGCTTG CTTGGACTTGCCCATCTAGC  122 UUA3B3 219281_at MSRATTATCTGTGCTCTCTGCCCG CCAGTGCCTTACAATTTGCA  123 MME8F8 201649_at UBE2L6CTTGCCATCCTGTTAGATTG CCAGTTCCTGGGACCAGGCC  124 MA4B4 202282_at HSD17B10TCAATGGAGAGGTCATCCGG CTGGATGGGGCCATTCGTAT  125 UUG6H6 218794_s_at TXNL4BCTTGCTTTTGGCTCATACAG GAGAGAGGGAAGGCTGCCAG  126 AAAE9F9 202866_at DNAJB12AGATTATAAGAACTGATGTG GCCAGAGTGCCTACCCACTG  127 LC7D7 203050_at TP53BP1TGTCACAAGAGTGGGTGATC CAGTGCCTCATTGTTGGGGA  128 IIC12D12 200045_at ABCF1GGTGGTGCTGTTCTTTTCTG GTGGATTTAATGCTGACTCA  129 HHC10D10 218523_at LHPPGGCACACAGGGTACTTTCTG GACCCACTGCTGGACAGACT  130 AC11D11 202535_at FADDGAGTCTCCTCTCTGAGACTG CTAAGTAGGGGCAGTGATGG  131 PE9F9 202331_at BCKDHATCAGGGGACAGCATCTGCAG CAGTTGCTGAGGCTCCGTCA  132 IIC4D4 204087_s_at SLC5A6AGAGCAAGCACGTTTTCCAC CTCACTGTCTCCATCCTCCA  133 HHE7F7 201555_at MCM3TTGCATCTTCATTGCAAAAG CACTGGCTCATCCGCCCTAC  134 OOG4H4 212557_at ZNF451AGGAGGTAGTCACTGAGCTG GACCTTAAACACATCTGCAG  135 QQC2D2 204809_at CLPXGCCCCGCCAAGCAGATGCTG CAAACAGCTAAACTGTCATA  136 PPC9D9 203301_s_at DMTF1CGAGAGAATAGTTTGTCATC CACTTAGTGTGTTAGCTGGT  137 PPE2F2 202361_at SEC24CCCTGCTGGGACACCGCTTGG GCTTTGGTATTGACTGAGTG  138 XG12H12 202716_at PTPN1CGAGGTGTCACCCTGCAGAG CTATGGTGAGGTGTGGATAA  139 PPE12F12 204042_at WASF3GCACAAGGCAAGTGAGTTTG CACTGTCAGCCCCAGACCGT  140 HHE11F11 201675_at AKAP1AGACATGAACTGACTAATTG GTATCCACTACTTGTACAGC  141 BBBE11F11 217989_atHSD17B11 TCCAATGCCAAACATTTCTG CACAGGGAAGCTAGAGGTGG  142 SSA8B8202260_s_at STXBP1 GTCTCCCTCCCAACTTATAC GACCTGATTTCCTTAGGACG  143 AAE5F5201225_s_at SRRM1 GAAATGAATCAGGATTCGAG CTCTAGGATGAGACAGAAAA  144IIE11F11 202624_s_at CABIN1 GTAAATCTGCCCACACCCAG CTGGCCATATCCACCCCTCG 145 UC2D2 202705_at CCNB2 TTGTGCCCTTTTTCTTATTG GTTTAGAACTCTTGATTTTG 146 MMA11B11 202798_at SEC24B TTGAACTCTGGCAAGAGATG CCAAAAGGCATTGGTACCGT 147 IIG5H5 200053_at SPAG7 TGCTATTAGAGCCCATCCTG GAGCCCCACCTCTGAACCAC 148 HHG2H2 202945_at FPGS CACACCTGCCTGCGTTCTCC CCATGAACTTACATACTAGG 149 OOE9F9 201292_at TOP2A AATCTCCCAAAGAGAGAAAC CAATTTCTAAGAGGACTGGA 150 NC9D9 209760_at KIAA0922 GCCCCATCAACCCCACCACG GAACATTCGACCCACATGGA 151 XA4B4 204755_x_at HLF TCGTCAATCCATCAGCAATG CTTCTCTCATAGTGTCATAG 152 AAG6H6 209147_s_at PPAP2A ACGCCCCACACTGCAATTTG GTCTTGTTGCCGTATCCATT 153 QQE4F4 205190_at PLS1 TCCATCTTCCACTGTTAGTG CCAGTGAGCAATACTGTTGT 154 XC4D4 201391_at TRAP1 CGAGAACGCCATGATTGCTG CTGGACTTGTTGACGACCCT 155 UUG2H2 218807_at VAV3 TGGGCCTGGGGGTTTCCTAG CAGAGGATATTGGAGCCCCT 156 TTG9H9 209806_at HIST1H2BK GGGGTTGGGGTAATATTCTGTGGTCCTCAGCCCTGTACCT  157 PPG10H10 203755_at BUB1B GCTTGCAGCAGAAATGAATGGGGTTTTTGACACTACATTC  158 MA9B9 203465_at MRPL19 CCAGAATGGTCTTTAATGAGCATGGAACCTGAGCAAAGGG  159 VA9B9 202679_at NPC1 CCTTTTAGGAGTAAGCCATCCCACAAGTTCTATACCATAT  160 RRE8F8 218051_s_at NT5DC2 CTTCTCTGACCTCTACATGGCCTCCCTCAGCTGCCTGCTC  161 JJA4B4 204828_at RAD9A GCCTTGGACCCGAGTGTGTGGCTAGGGTTGCCCTGGCTGG  162 PPA12B12 203965_at USP20 ATCAGGATCAAAGCAGACGGGGCGTGGGTGGGGAAGGGGC  163 JJA9B9 209507_at RPA3 TGGAATTGTGGAAGTGGTTGGAAGAGTAACCGCCAAGGCC  164 XE1F1 203068_at KLHL21 CAGTTCACCCCAGAGGGTCGGGCAGGTTGACATATTTATT  165 NNNG3H3 201339_s_at SCP2 TCAGCTTCAGCCAGGCAACGCTAAGCTCTGAAGAACTCCC  166 PPG2H2 202369_s_at TRAM2 TGAAGGATGAACTAAGGCTGCTGGTGCCCTGAGCAACTGA  167 UUC11D11 208716_s_at TMCO1AAGGCACTGTGTATGCCCTG CAAGTTGGCTGTCTATGAGC  168 CG4H4 218271_s_at PARLGGGATTGGACAGTAGTGGTG CATCTGGTCCTTGCCGCCTG  169 KKC6D6 202188_at NUP93AGGTCCTCATGAATTAAGTG CCATGCTTTGTGGGAGTCTG  170 BBBA5B5 221245_s_at FZD5GAGCCAAATGAGGCACATAC CGAGTCAGTAGTTGAAGTCC  171 RRE5F5 219485_s_at PSMD10TGTGAGTCTTCAGCACCCTC CCATGTACCTTATATCCCTC  172 LA6B6 201263_at TARSCAGTGGCACTGTTAATATCC GCACAAGAGACAATAAGGTC  173 NNC5D5 213196_at ZNF629AAACTGCTATGGACATGGAG GTCAGATGGGAACTTGGAAC  174 TC8D8 201932_at LRRC41GCAAACAGGCATTCTCACAG CTGGGTTTATAGTCTTTGGG  175 SG8H8 204758_s_at TRIM44GTCCTGACTCACTAAAGATG CCAGGATATTGGGGCTGAGG  176 IIIG8H8 213669_at FCHO1CGCATGTCGCTGGTGAAGAG GAGGTTTGCCACAGGGATGT  177 NNA7B7 219581_at TSEN2CACTTTCATACGCAGGCATC TCTTGTTACCTACATCTAAG  178 LE7F7 201704_at ENTPD6TTCTGGACACCAACTGTGTC CTGTGAATGTATCGCTACTG  179 ZA7B7 205225_at ESR1CCCTTTGCATTCACAGAGAG GTCATTGGTTATAGAGACTT  180 CCCG4H4 210582_s_at LIMK2AAGCTCGATGGGTTCTGGAG GACAGTGTGGCTTGTCACAG  181 NNE11F11 202382_s_atGNPDA1 GTGCCTGTTTGAAGCTACTG CTGCCTCCATTTCTGGGAAA  182 PPE6F6 202809_s_atINTS3 TATGACGTGGTCAGGGTGTC CATTCCTAATCATGGGGCAG  183 SSG9H9 201833_atHDAC2 ACCAAATCAGAACAGCTCAG CAACCCCTGAATTTGACAGT  184 BBBE9F9 200697_atHK1 TCCGTGGAACCAGTCCTAGC CGCGTGTGACAGTCTTGCAT  185 NA7B7 208741_at SAP18GGAATTGGTGTCCCTGTTAG CAATGGCAGAGACCAGCCTG  186 UC6D6 202117_at ARHGAP1CTGGTCTGTACCCCAGGGAG CGGGTGCTTGTACTGTGTGA  187 TE9F9 202651_at LPGAT1GCTGGTCACACGTGGATCTG GTTTATGAATGCATTTGGGA  188 LE3F3 203073_at COG2TGGGCTTTCTAAAGAGGCTG CGGGAAGCCATCCTCCACTC  189 IIIC2D2 218108_at UBR7GCAGCACAATAGTACCGATC AGTTAACTCAGCGCTGAAGG  190 HHHC9D9 201855_s_at ATMINGCATGTAATAATACAAGAAC TGTTTCCCCCTCAAAACCTG  191 PPE5F5 202763_at CASP3ACTGCACCAAGTCTCACTGG CTGTCAGTATGACATTTCAC  192 OOA3B3 206109_at FUT1TGAGATAAAACGATCTAAAG GTAGGCAGACCCTGGACCCA  193 VE3F3 202891_at NIT1GAACCTTGACTCTCTTGATG GAACACAGATGGGCTGCTTG  194 RRC12D12 204313_s_atCREB1 TGTCCTTGGTTCTTAAAAGC ATTCTGTACTAATACAGCTC  195 QA9B9 209029_atCOPS7A TTTCCTCTCTCTGGCCCTTG GGTCCTGGGAATGCTGCTGC  196 PG7H7 209304_x_atGADD45B GGGAGCTGGGGCTGAAGTTG CTCTGTACCCATGAACTCCC  197 PPC4D4 202691_atSNRPD1 CTAGAATTGATTCTCCTTTC CTGAGTTTTACTCCACGGAG  198 RRA2B2 218375_atNUDT9 GCCATGCGTTGTAGCTGATG GTCTCCGTGTAAGCCAAAGG  199 PPC8D8 203080_s_atBAZ2B AACCACTGTGTTTTATCTAC TGTGTGTTGTGGTGGCCTGT  200 BBBC10D10 221750_atHMGCS1 GGGCAGGCCTGCAAATACTG GCACAGAGCATTAATCATAC  201 QQA11B11 213119_atSLC36A1 GACATAAATGGTGCTGGTAG GAGGTTATCAGAGTAAGGAA  202 EEEC12D12202011_at TJP1 GGGGCAGTGGTGGTTTTCTG TTCTTTCTGGCTATGCATTT  203 QQC7D7208190_s_at LSR TGGGCGGCTACTGGAGGAGG CTGTGAGGAAGAAGGGGTCG  204 UC10D10202468_s_at CTNNAL1 ATGACAAGCTTATGCTTCTC CTGGAAATAAACAAGCTAAT  205 QA7B7218206_x_at SCAND1 TCGGGCCCGGGGGCCTGAGC CTGGGACCCCACCCCGTGTT  206EEEG10H10 204158_s_at TCIRG1 TGCTGGTCCCCATCTTTGCC GCCTTTGCCGTGATGACCGT 207 XG2H2 202128_at KIAA0317 TTAGCGTCTTTGAAGGAGAC CAGACATGAGTGAATACCTA 208 RG3H3 203105_s_at DNM1L TTATGAACTCCTGTGTATTG CAATGGTATGAATCTGCTCA 209 QQE5F5 205633_s_at ALAS1 TCCTATTTCTCAGGCTTGAG CAAGTTGGTATCTGCTCAGG 210 NG7H7 203228_at PAFAH1B3 TGGCTTTGTGCACTCAGATG GCACCATCAGCCATCATGAC 211 RC1D1 208820_at PTK2 ACCAGAGCACCTCCAAACTG CATTGAGGAGAAGTTCCAGA  212ZZG8H8 204765_at ARHGEF5 GCTTAAACATTCTCCGCCTC CAGGGTGCAGATTCAGAGCT  213IIE9E9 201719_s_at EPB41L2 TGGTTACAAGAAAGTTATAC CATTTAAAGCTGGCACCAGA 214 JJG10H10 212591_at RBM34 AGGATTGTGAGAGACAAAAT GACAGGCATCGGCAAAGGGT 215 OE11F11 202633_at TOPBP1 TCTTTTAACAGGAGCCTGAG CACAAGGTTTAATGAGGAAG 216 AAAG1H1 209213_at CBR1 TGACATGGCGGGACCCAAGG CCACCAAGAGCCCAGAAGAA 217 EEEE6F6 208879_x_at PRPF6 GCCTGCAACATTCGGCCGTG GTTACGATGAGTTTACCCCT 218 NE3F3 206398_s_at CD19 TGACTCTGAAATCTGAAGAC CTCGAGCAGATGATGCCAAC 219 TTA1B1 209095_at DLD CTTTTGTAGAAGTCACATTC CTGAACAGGATATTCTCACA  220HHA9B9 201207_at TNFAIP1 AGTCTTTTTTGCCGAGAAAG CACAGTAGTCTGGGACTGGG  221IIC9D9 201462_at SCRN1 CAGTCCCAGGTCCCAGCTCC CCTCTTATGGTTTCTGTCAT  222FFFA3B3 218245_at TSKU GCAGTGAGCTCTGTCTTCCC CCACCTGCCTAGCCCATCAT  223PC4D4 212910_at THAP11 TTTTCCTTCCCAGGTGCAGC CTGTGATTCTGATGGGGACT  224IIG2H2 219968_at ZNF589 AGGAATGGCTGGTCCAGAGG CTTTTGTCCACTCCCTCTCA  225MMMC11D11 221531_at WDR61 ATGCCTCCTGGGTGCTGAAC GTTGCATTCTGTCCTGATGA  226NNNG7H7 205172_x_at CLTB GTCGGGGTGGAGACTCGCAG CAGCTGCTACCCACAGCCTA  227WE7F7 202788_at MAPKAPK3 GGTATACTTGTGTGAAAGTG GCTGGTTGGGAGCAGAGCTA  228ZG4H4 212054_x_at TBC1D9B GTGTTAGCCCCCACATGGGG CTGCTCTTGCTTCTACTAAA  229SSG4H4 208510_s_at PPARG TGCTCCAGAAAATGACAGAC CTCAGACAGATTGTCACGGA  230QG10H10 203574_at NFIL3 GAGACTTATAGCCACACAAC CAATCTCTGCTTCAGACTCT  231YE1F1 201032_at BLCAP CGCTTCAGTAACAAGTGTTG GCAAACGAGACTTTCTCCTG  232TE12F12 201889_at FAM3C ATATGCTAAATCACATTCAG CATGTGTATTTTGACATTTA  233MMG11H11 202946_s_at BTBD3 GGCAGTCTTTGTCGTTGTTC ATTCTGGGGATAAAGGGGAA 234 UUG10H10 201380_at CRTAP TGCATCTCCAAAATTACAAC GGTTGGCCGATCCCATTTGA 235 FFFA8B8 219711_at ZNF586 CCTGCCAGTCATGAATCTCA GACAGCCTGCCACCTATTGC 236 QC8D8 203646_at FDX1 GAAGGCAGAGATCTAACCTG GCTTGTTTAGGGCCATACCA  237HHHA6B6 204985_s_at TRAPPC6A AGGTGGGGGTGTCAGAGGAG GCAAAGGGGTCCCAGCTGCG 238 SA3B3 202680_at GTF2E2 TTTTTCTCCACTTCTAAATG GTTCCTGGTTCCTTTCTTCC 239 EEEA12B12 213135_at TIAM1 TATCATCTCCGGTTCGATCG CGTCCAGATGGAAAACGGAA 240 VG7H7 201761_at MTHFD2 AAGTACGCAACTTACTTTTC CACCAAAGAACTGTCAGCAG 241 TTG3H3 217825_s_at UBE2J1 CCTTGATTCAGTGCTCAGTG GTCTCCTAGTAAGAAGTCAC 242 OOC8D8 201158_at NMT1 GGTGCCATGTCTGGGAACAG GGACGGGGGAGCTTCACCTT 243 PPA7B7 202813_at TARBP1 TTCCTCAACAGGGCATTATC CGCTCCCTGAATGTCCATGT 244 JJJG4H4 206066_s_at RAD51C CACTGGAACTTCTTGAGCAGGAGCATACCCAGGGCTTCAT  245 PA5B5 217934_x_at STUB1 TGTTTCCCCTCTCAGCATCGCTTTTGCTGGGCCGTGATCG  246 MMA3B3 202394_s_at ABCF3 TATTCCCAAATGTCTCTATCCTTTTGACTGGAGCATCTTC  247 TA6B6 208647_at FDFT1 CATTCAGTGCCACGGTTTAGGTGAAGTCGCTGCATATGTG  248 LE1F1 202733_at P4HA2 TGTCTGGAGCAGAGGGAGACCATACTAGGGCGACTCCTGT  249 JJJG6H6 201589_at SMC1A CAATCCATCTTCTGTAATTGCTGTATAGATTGTCATCATA  250 IIIC4D4 215000_s_at FEZ2 GGTGGTGATGGATTTTGTAGCTTGCTGCTTGTTTCACCAC  251 LC11D11 203963_at CA12 CACAGACAGTTTCTGACAGGCGCAACTCCTCCATTTTCCT  252 YC3D3 206662_at GLRX ATGGATCAGAGGCACAAGTGCAGAGGCTGTGGTCATGCGG  253 BBBG2H2 202942_at ETFB TGCTGGGCAAACAGGCCATCGATGATGACTGTAACCAGAC  254 XC6D6 201234_at ILK AGAAGATGCAGGACAAGTAGGACTGGAAGGTCCTTGCCTG  255 UUG9H9 212206_s_at H2AFV CCCTGTTTCCTGTTGATATGGTGATAGTTGGAGAGTCAAA  256 RRA1B1 217906_at KLHDC2 TGATCACCTTGCATGGACAGCAATCCTGTAAACATCACAG  257 OE12F12 201494_at PRCP ATCAGTGGCCCTCATAACTGGAGTAGAGTTCCTGGTTGCT  258 RA1B1 204054_at PTEN CTACCCCTTTGCACTTGTGGCAACAGATAAGTTTGCAGTT  259 RRC9D9 218856_at TNFRSF21 GGTCCAATCTGCTCTCAAGGCCTTGGTCCTGGTGGGATTC  260 LLLE7F7 211747_s_at LSM5 AGCTAAGTTTCCCGTTAAAGGGAAGTGCTTTGAAGATGTG  261 RRE12F12 206364_at KIF14 TTGCTGGCACAGTAGTTTACCCTGTTATCTGTGTTTCATA  262 JJC4D4 204849_at TCFL5 TTGTCATGACTCTGAGTCACGTGCTGCTGTATTGCAACGT  263 PPA1B1 202153_s_at NUP62 ACAATGAAGCCCAGTGTAACGTCAGTCCACAGAAATAGCC  264 HHE5F5 218014_at NUP85 ACGTCTCGGATTGCCCCTCGGTCTTTCTGGATGACTCTGC  265 KKG10H10 205088_at MAMLD1 GCACCCTCGTGGGGTTAAGGCGAGCTGTTCCTGGTTTAAA  266 JJC6D6 205340_at ZBTB24 TGAAACACCTCGTTTTGAAGGTGAATCTTTGGTTTTCTCC  267 KKE3F3 203130_s_at KIF5C TCCATGTAACAAAAGATCTGGAAGTCACCCTCCTCTGGCC  268 YC5D5 208309_s_at MALT1 CTGTCATTGCAGCCGGACTCCAGATGCATTTATTTCAAGT  269 TTE4F4 221567_at NOL3 ACCCCACGCAAGTTCCTGAGCTGAACATGGAGCAAGGGGA  270 NE1F1 219650_at ERCC6L ATCTCAAAAAGCAACTTCTGCCCTGCAACGCCCCCCACTC  271 KKC10D10 201121_s_at PGRMC1CTCTCCTAAGAGCCTTCATG CACACCCCTGAACCACGAGG  272 SSA1B1 203201_at PMM2GTTCCCTCCAAACCTCCCAG CCACTCGGGCTTGTAACTGT  273 LLE4F4 218170_at ISOC1GGATAGAAGGGTTTGCAATG CCATATTATTGGTGGAGGGC  274 IIIC5D5 203288_atKIAA0355 TGTGTGAAGCCGTTTGTGTG GTCTCCATGTAGGTGCTGTG  275 BBBA3B3217838_s_at EVL TAAGGGGCCGGCCTCGCTGC GCTGATTCGTCGAGCCCATC  276 HHG4H4213292_s_at SNX13 CTCAAATACTGTTGTGTCTG CACCAGTCTTTTAGTGTCTC  277 UC1D1202602_s_at HTATSF1 GGGCCCCTATCCACTGGCAG CAGCTTTATTCTCAGTAGCG  278 ZC4D4202349_at TOR1A CACCTTAGCAACAATGGGAG CTGTGGGAGTGATTTTGGCC  279 MMME10F10201560_at CLIC4 CCAGAGTTGCATGTAGATAG CATTTATTTCTGTGCCCTTA  280 ZZA4B4207749_s_at PPP2R3A TTTGCCTCAAACCTCTTACG GAGCTTCTCCTCAGAAGTGG  281MMC12D12 203188_at B3GNT1 TGTGGCCTTGAGTAAATCCC GTTACCTCTCTGAGCCTCGG  282LLC12D12 202187_s_at PPP2R5A CCTCACAACCTGTCCTTCAC CTAGTCCCTCCTGACCCAGG 283 IIG4H4 205607_s_at SCYL3 TAGGCAGTTCCTGACTGTTC CACATGTAGTACATTGTACC 284 LLE9F9 205130_at RAGE CATTTCTGTGATGTGTTGGG CGTGGTTGGAAGGTGGGTTC 285 IIIE11F11 218854_at DSE CTGGTCTCTGCACACATATG CTTGGTTACTTGCATGCATT 286 OOA2B2 203857_s_at PDIA5 TGTTCTACGCCCCTTGGTGC CCACACTGTAAGAAGGTCAT 287 QQE7F7 208445_s_at BAZ1B ACTGCGGAATGTGGCCTCTG CTTCCTCCGTCCTCCTGCCC 288 NNNE4F4 203360_s_at MYCBP AAAATCCAGAAATAGAGCTG CTTCGCCTAGAACTGGCCGA 289 JJC7D7 205909_at POLE2 AGGACATCTGACTCCCCTAC CTCTTTATGTCTGCCCAGTG 290 YYG6H6 210563_x_at CFLAR CTTGAAGATGGACAGAAAAG CTGTGGAGACCCACCTGCTC 291 UC4D4 200071_at SMNDC1 GGATGTGTGATGTTTATATG GGAGAACAAAAAGCTGATGT 292 PA9B9 209259_s_at SMC3 TTGGAAAATACTACCTACTG GTTTGGGAGATGTATATAGT 293 OOC2D2 203931_s_at MRPL12 TCCAAGGCATCAACCTCGTC CAGGCAAAGAAGCTGGTGGA 294 KKE1F1 200678_x_at GRN CCTGTCAGAAGGGGGTTGTG GCAAAAGCCACATTACAAGC 295 JJJE9F9 202735_at EBP CCTGCCAGAAGAGTCTAGTC CTGCTCCCACAGTTTGGAGG 296 BC8D8 201804_x_at TBCB TTGGTGTCCGCTATGATGAG CCACTGGGGAAAAATGATGG 297 LLE2F2 219573_at LRRC16A CGGAGTACTGCTAAGTGTAC CTGTGTCAAATCCGCACAGG 298 XC8D8 201614_s_at RUVBL1 GCTGCCGTCCCCACTCAGGC GTGGTCTGCAGCGCTGTCAG 299 EEEE10F10    336_at TBXA2R CCCTGAATTTGACCTACTTGCTGGGGTACAGTTGCTTCCT  300 AAG2H2 202052_s_at RAI14 TTCAGAAAATACACAACAGCCCCTTCTGCCCCCGCACAGA  301 RC12D12 212899_at CDC2L6 TTTCCTGCTTTTGAGTTGACCTGACTTCCTTCTTGAAATG  302 TE3F3 202433_at SLC35B1 TGGCCTCTGTGATCCTCTTCGCCAATCCCATCAGCCCCAT  303 AAG10H10 201591_s_at NISCHTCTGACTTTCTCTTCTACAC GTCCTTTCCTGAAGTGTCGA  304 OG4H4 202518_at BCL7BTGAGGTTCTGACAACAGTAC CCATCCCCCACAGTACCCCT  305 RRG4H4 219184_x_at TIMM22GCTGAGGGGCTGTTCACCAC CATCCTCGTTCTCCAGGGTC  306 WE1F1 203334_at DHX8GAAAGGGACAATTTGTGCAG CTCCAGGATGGGAAGGTGGA  307 LLLC9D9 204517_at PPICGTCACCCTTTAGTTTGCTTG AACTTTAGTAAACCACCTGC  308 WA2B2 202396_at TCERG1GCATTTGTGGCTTGAACTTG CCAGATGCAAATACCACAGA  309 NE2F2 218034_at FIS1TTTCTGCTCCCCTGAGATTC GTCCTTCAGCCCCATCATGT  310 VC7D7 209189_at FOSCCCAGTGACACTTCAGAGAG CTGGTAGTTAGTAGCATGTT  311 HHG3H3 212462_at MYST4TGTACAGGGTGACAGTAAGG GCCAAGCAGGAGAGGCGTAA  312 AAG12H12 202329_at CSKGGGCATTTTACAAGAAGTAC GAATCTTATTTTTCCTGTCC  313 JJJG12H12 206571_s_atMAP4K4 GGAGCTGCACCGAGGGCAAC CAGGACAGCTGTGTGTGCAG  314 VG6H6 202778_s_atZMYM2 ACTGGGTTCTTAACCAGATG GTTGTGTATGGGTAGCACTA  315 OC9D9 205376_atINPP4B TCAACATGCTACAGCTGATG GCTTTCCCCAAGTACTACAG  316 FFFG8H8 218916_atZNF768 GAAGTGACATGCCCTGGAGA CTTGTGGGAAGTGGGTTGGA  317 IIA8B8 219499_atSEC61A2 CACCGAGCTAAGTCTGTGTG CAGCATTAGTACCCGCTGCC  318 JJA12B12218898_at FAM57A CCCATTCCTGTGTGTCCGTC CTGCCATTTAGCCACAGAAG  319 BBBG1H1220161_s_at EPB41L4B CCCTAGTCTGTTGGTAGAAC CAGAAATCAATATGTTGTCT  320RRA6B6 200981_x_at GNAS GCATGCACCTTCGTCAGTAC GAGCTGCTCTAAGAAGGGAA  321QQC8D8 209191_at TUBB6 TCGGCCCCTCACAAATGCAG CCAAGTCATGTAATTAGTCA  322RC7D7 202776_at DNTTIP2 GGAAGTACTCAGAGATCATG GCTGAAAAAGCAGCAAATGC  323NNNA6B6 203582_s_at RAB4A ACAGATGCCCGAATGCTAGC GAGCCAGAACATTGTGATCA  324QQC3D3 204977_at DDX10 AGATCGAGGGTGGATGATAC CATTTCCTGACCCCGTTTTC  325QQA10B10 201412_at LRP10 GCACCGGAATGCCAATTAAC TAGAGACCCTCCAGCCCCCA  326RC3D3 203367_at DUSP14 CACTTTGGGGCCTCATTAAC CCTTTAGAGACAAGCTTTGC  327MMMG8H8 201379_s_at TPD52L2 GGGTTAAAATCGGCCTGTGG GGTGTGGTGAGAAGGCAGGT 328 AAAG3H3 203973_s_at CEBPD TGCCCGCTGCAGTTTCTTGG GACATAGGAGCGCAAAGAAG 329 EEEE12F12 212770_at TLE3 GTCTCTTGTGGCCCAAACAG GTTAGGTAGACTATCGCCTC 330 AAE9F9 203192_at ABCB6 AACCTCTGAAGACACTAAGC CTCAGACCATGGAACGGTGA 331 SSE10F10 202180_s_at MVP CTGAAATCAACCCTCATCAC CGATGGCTCCACTCCCATCA 332 PPC6D6 202801_at PRKACA TTCAAGGCTAGAGCTGCTGG GGAGGGGCTGCCTGTTTTAC 333 JJE9F9 209691_s_at DOK4 GTGGCAGGAGGATGATAAAG CACGCGGCCCCTCCCAAAGG 334 LLLA2B2 201185_at HTRA1 ATGCGTAGATAGAAGAAGCC CCACGGGAGCCAGGATGGGA 335 OA9B9 207700_s_at NCOA3 ATAGTATACTCTCCTGTTTG GAGACAGAGGAAGAACCAGG 336 UUG1H1 219460_s_at TMEM127 TACACCCAGCCCCGAGTGTGCATCACGGTAAAAGAGCTGA  337 YG10H10 205548_s_at BTG3 CATTGTGACCGGAATCACTGGATTAATCCTCACATGTTAG  338 RG10H10 218039_at NUSAP1 AGCTGGGATAGAAAGGCCACCTCTTCACTCTCTATAGAAT  339 LLG4H4 218290_at PLEKHJ1 CATCCAAAGCCTGAAGCCAGGTGGGTGTGGGCAGGGGCTG  340 PPA2B2 202328_s_at PKD1 GGGCAAGTAGCAGGACTAGGCATGTCAGAGGACCCCAGGG  341 XG5H5 201976_s_at MYO10 GGGGGAGAGACGCTGCATTCCAGAAACGTCTTAACACTTG  342 LLG7H7 212726_at PHF2 CTGGATGTTTTTGTCCACTGGGAGAGGCAGCTTGGTGGAG  343 YC4D4 201000_at AARS GAACACACTTGGGAGCAGTCCTATGTCTCAGTGCCCCTTA  344 PA8B8 210640_s_at GPER CCCTCTGTGGAGCGCCCGCCGTCTGCTCCGGGGTGGTTCA  345 SSC9D9 201727_s_at ELAVL1 CACTCCTCTCGCAGCTGTACCACTCGCCAGCGCGACGGTT  346 MMA7B7 207290_at PLXNA2 GCCTGGCCACCCACACTCTGCATGCCCTCACCCCACTTCT  347 HHA12B12 210074_at CTSL2 GATGGATGGTGAGGAGGAAGGACTTAAGGACAGCATGTCT  348 LLLG4H4 202087_s_at CTSL1 TTCATCTTCAGTCTACCAGCCCCCGCTGTGTCGGATACAC  349 OOG3H3 209435_s_at ARHGEF2GGGGATTTTTCAGTGGAACC CTTGCCCCCAAATGTCGACC  350 JJJC5D5 203126_at IMPA2ACCCCAGAGGGAGTTGTCAC GCTACAGTGAGTGGCTGGCC  351 YE10F10 217722_s_at NGRNAATAGGAAGAGGTGTTGAGC CTGGACTGTGGGAGGAAAGA  352 ZZC9D9 202207_at ARL4CGTGGTCACCAGGGGGACAGG GAGCCCCCCACCAATGTATC  353 QG7H7 206688_s_at CPSF4ATTTTCTCTTGGGGTACGTG CCTGACAGTGTTTAAGGTGT  354 NNNC6D6 218193_s_atGOLT1B TGAAATCCATGTTAATGATG CTTAAGAAACTCTTGAAGGC  355 SSC11D11 202675_atSDHB AAGGCAAGCAGCAGTATCTG CAGTCCATAGAAGAGCGTGA  356 XE2F2 203266_s_atMAP2K4 TGCTGTCAACTTCCCATCTG GCTCAGCATAGGGTCACTTT  357 PA7B7 201967_atRBM6 GTTGGAGCCTCAGGAAGAAC CAGCAAAAGACAGTCCAACG  358 IIIG5H5 212851_atDCUN1D4 AGTGGACAAGAAACCACCAG CATTGAGCTAACCCAGTACA  359 UA12B12 203640_atMBNL2 GGAACTACATTTCACTCTTG GTTTTCAGGATATAACAGCA  360 UA6B6 201960_s_atMYCBP2 TCAAACTTGTGAGGTGTTTG CATGTGGCCATTACCGTCAT  361 UUC4D4 200636_s_atPTPRF GTCCTTATTATCCCAGCTTG CTGAGGGGCAGGGAGAGCGC  362 NNG11H11202427_s_at BRP44 CTTTGTGGGGGCAGCAGGAG CCTCTCAGCTTTTTCGTATT  363AAAA12B12 200789_at ECH1 TGGCCGAGAGCCTCAACTAC GTGGCGTCCTGGAACATGAG  364AAAE5F5 218597_s_at CISD1 ACCACCTCTGTCTGATTCAC CTTCGCTGGATTCTAAATGT  365RRA11B11 202550_s_at VAPB AACTCTGTTGGGTGAACTGG TATTGCTGCTGGAGGGCTGT  366MMMA11B11 209337_at PSIP1 GGTCATTTGGCACTTCTCAG CAAGTAGGATACTTCTCATG  367OOE3E3 208626_s_at VAT1 AGGACCTGGGCCATTGCAAC CAAAATGGGGACTTCCTGGG  368NNNE1F1 222125_s_at P4HTM CCCCGCCAGCCGCGATACGG CGCAGTTCCTATATTCATGT  369KKG9H9 200078_s_at ATP6V0B TCCAGAGTGAAGATGGGTGA CTAGATGATATGTGTGGGTG 370 YA2B2 200752_s_at CAPN1 CTTCAGGGACTTGTGTACTG GTTATGGGGGTGCCAGAGGC 371 WE9F9 217874_at SUCLG1 TCAGTATGTCTCCTGCACAG CTGGGAACCACGATCTACAA 372 HHC4D4 212723_at JMJD6 ACCCATTCACTTAGCGTTTG CTCCAGTAGCTTTCCCTCTG 373 ZZC5D5 212811_x_at SLC1A4 GAAGGGGAAGATCTGAGAGC GTGCTGTTTGTGGCTGTTGA 374 MME4F4 212140_at PDS5A GGCCCACCCCAATTTTGTAA CATGATGCAAGTGTCTGGCA 375 AAE11F11 219222_at RBKS GCTTACTATCCAAATCTGTC CTTGGAAGACATGCTCAACA 376 TTE12F12 217950_at NOSIP CTGGGGCTGTGGTCACCCTC GAATGCGTGGAGAAGCTGAT 377 OOC10D10 201432_at CAT TTAATACAGCAGTGTCATCA GAAGATAACTTGAGCACCGT 378 NNNC1D1 218845_at DUSP22 TTATCCCCACTGCTGTGGAG GTTTCTGTACCTCGCTTGGA 379 YG2H2 201314_at STK25 GCCTTGTGGTGTTGGATCAG GTACTGTGTCTGCTCATAAG 380 MMG9H9 202414_at ERCC5 AAACCAGTGCTTCAGATTCG CAGAACTCAGTGAAGGAAGC 381 PE5F5 203659_s_at TRIM13 TTCTTTGCCTCAAGACACTG GCACATTCATTAGCAAGATT 382 FFFA2B2 210241_s_at TP53TG1 CATGATGCTGGGGAGCTTGGCGCCTGACCCAGGATCTAGA  383 RRE7F7 204761_at USP6NL TAGTAGAAAACCCGACATTGATGTTTCTTCCTGTTGCAAG  384 XA9B9 208946_s_at BECN1 ATCTATAGTTGCCAGCCCTGGTCAGTTTTGATTCTTAACC  385 CCCE2F2 204017_at KDELR3 CCTTCAGGCCAGAAGCAAACCAAATTTACCAGGTTTGGCT  386 BBBA1B1 204256_at ELOVL6 GATGGCAAGGGCTTTTTCAGCATCTCGTTTATGTGTGGAA  387 RRC11D11 221848_at ZGPAT ACTGCTGAGTGGAGACAGAGCTGCGGGGTCCCATCTGGAC  388 JJG5H5 205161_s_at PEX11A TGATGTGGGCAGAGATGAGGCCAAGAACGGAGAAGGGAGG  389 VC2D2 202894_at EPHB4 GGTGGAACCCAGAAACGGACGCCGGTGCTTGGAGGGGTTC  390 YG9H9 209710_at GATA2 CGCTGCAGGGAGCACCACGGCCAGAAGTAACTTATTTTGT  391 TTC9D9 215980_s_at IGHMBP2AGAGCCTCCCGGCCTTCTCC GGTGTCCTGTACCAACTCTT  392 RE9F9 203221_at TLE1TTGCCCAAGTGTGAGATTAC CTTTCTGTTCCTTGCAGTTC  393 IIC6D6 202950_at CRYZAGTTTCCAAGGGTTTTCAAG CCTACTTACCTTTATAAAGG  394 OG10H10  40562_at GNA11CTCTCCCTCCGTACACTTCG CGCACCTTCTCACCTTTTGT  395 RE11F11 203302_at DCKTCAAAGATGATAATTTAGTG GATTAACCAGTCCAGACGCA  396 NNG12H12 202545_at PRKCDTTCTTCAAGACCATAAACTG GACTCTGCTGGAAAAGCGGA  397 PPE11F11 203884_s_atRAB11FIP2 GGGCCTGTTAGTCTTCGAAG CTTCCAGATGGTTTGTGTTT  398 QQE10F10212973_at RPIA GGGGTTTCTTCATATTCCTG CTGTTGGAAGCAGTTGACCA  399 HHE6F6202452_at ZER1 GGCAGGACGGCAGGGGTGAG CAGCTTTGGGAGAGACACCT  400 LG6H6221046_s_at GTPBP8 TGACCTTTTCTGGAATCCAC CTGTTGAGATGCTTTATAGC  401 OA8B8201366_at ANXA7 AGCTCTGCCTTCCGGAATCC CTCTAAGTCTGCTTGATAGA  402 WG12H12202954_at UBE2C CCCAGGCTGCCCAGCCTGTC CTTGTGTCGTCTTTTTAATT  403 SSA10B10201984_s_at EGFR ATCTGTGTGTGCCCTGTAAC CTGACTGGTTAACAGCAGTC  404 XA2B2201161_s_at CSDA GGGACAGACCTTTGACCGTC GCTCACGGGTCTTACCCCAT  405 LLA9B9206173_x_at GABPB1 CTGTGGATGGTGCCATTCAG CAAGTAGTTAGTTCAGGGGG  406 LA2B2207038_at SLC16A6 GACACAAGGAGGCAGAGGAG CTAACCCCTCTACTCCACTT  407AAE10F10 202179_at BLMH AGACCTAATGCTCCTTGTTC CTAGAGTAGAGTGGAGGGAG  408IIIA1B1 209567_at RRS1 TGCCTTCATTGAGTTTAAAG GGACAGGATTGCCCTTCCGT  409NNNE10F10 209109_s_at TSPAN6 CGCCTACTGCCTCTCTCGTG CCATAACAAATAACCAGTAT 410 TTA12B12 209260_at SFN GCATGTCTGCTGGGTGTGAC CATGTTTCCTCTCAATAAAG 411 SSG3H3 201729_s_at KIAA0100 ATGATTTGGCGATTCGAGTGGCTGCAGTACAGGATCTGAC  412 HHE10F10 209166_s_at MAN2B1GCGCCCCCGTTACCTTGAAC TTGAGGGACCTGTTCTCCAC  413 LC6D6 201794_s_at SMG7GACAAGCTAACCAGGTTTAC CATCTCACTCCCAGTAATAC  414 LLA4B4 208936_x_at LGALS8AATCACCAATCAAGGCCTCC GTTCTTCTAAAGATTAGTCC  415 QQA2B2 204788_s_at PPOXCAATTCCTGACTGCTCACAG GTTGCCCCTGACTCTGGCTG  416 OOE2F2 204106_at TESK1GTCTCAGGCCTCCAACTTTG GCCTTCAGGACACCCTGTAA  417 MG11H11 201849_at BNIP3CAGTTTTCTGCTGAAGGCAC CTACTCAGTATCTTTTCCTC  418 TE7F7 203685_at BCL2TTTCATTAAGTTTTTCCCTC CAAGGTAGAATTTGCAAGAG  419 HHHE11F11 205205_at RELBGATGTCTAGCACCCCCATCC CCTTGGCCCTTCCTCATGCT  420 XA10B10 203575_at CSNK2A2GGGTATGCAGAATGTTGTTG GTTACTGTTGCTCCCCGAGC  421 MMG2H2 202022_at ALDOCGCCAGGGCCAAATAGCTATG CAGAGCAGAGATGCCTTCAC  422 OOC12D12 201817_at UBE3CGGGGGGAGGGGATCTAAATC CTCATTTATCTCTTCTATGT  423 NNC9D9 201236_s_at BTG2GTGTTCTTGCATCTTGTCTG CAAACAGGTCCCTGCCTTTT  424 RG7H7 210022_at PCGF1CTGATCACATGACAATGAAG CAGATATGGCTCTCCCGCTG  425 YYC12D12 201565_s_at ID2CTGTGGACGACCCGATGAGC CTGCTATACAACATGAACGA  426 NE12F12 201186_at LRPAP1AGGACCTCGATGTCCAGCTG CTGTCAGGTCTGATAGTCCT  427 SC7D7 204324_s_at GOLIM4AAGGCCGAGAGGAACACTAC GAGGAGGAAGAAGAGGAGGA  428 KKA3B3 213370_s_at SFMBT1GTATCAGCTTGCTCTCTTTG CACTTTCGGGGAAGGAGGAC  429 VG1H1 201270_x_at NUDCD3AGAGTGAGGTGTCCAGCCTG CAAAGCTATTCCAGCTCCTT  430 NC10D10 204217_s_at RTN2CTAATTACCTGAGCGACCAG GACTACATTTCCCAAGAGGC  431 RRC8D8 201707_at PEX19AGATCATCTTTGAGTAGCAC TGTTTTGGGGCCCTCGGTCT  432 OOE12F12 201963_at ACSL1GAGAGTACATGTATTATATA CAAGCACAACAGGGCTTGCA  433 UA8B8 203038_at PTPRKTTTTTCAGCCTGTGGCCCAG CACTGGTCAAGAAAACAAGA  434 RA5B5 205202_at PCMT1GATGTCCTGTAAACACTCAG CTGTTCAGATTGGACATAAC  435 MME2F2 201924_at AFF1GCTCTCAATGGGAAGATGTG CAACACAAATTAAGGGGAAC  436 HHA5B5 213772_s_at GGA2CTTGTTGCACTGTTCCCAGG CGAGTGGCTGCCATGAGACC  437 YYC6D6 203773_x_at BLVRAACTGGCTGCTGAAAAGAAAC GCATCCTGCACTGCCTGGGG  438 PPA6B6 202797_at SACM1LCAAAGACCAAATCTGAACTG CTAATGTGGCTGCTTTGTAG  439 PPE3F3 202431_s_at MYCCCACAGCATACATCCTGTCC GTCCAAGCAGAGGAGCAAAA  440 MMMG6H6 209367_at STXBP2GCTCATCGTGTATGTCATGG GCGGTGTGGCCATGTCAGAG  441 RRE11F11 201361_atTMEM109 GAGGTGGATGTCCTTCTCTG CCAGGCTTGGCACATGATGT  442 MMME12F12210788_s_at DHRS7 TACATGCCAACCTGGGCCTG GTGGATAACCAACAAGATGG  443 AAG8H8203119_at CCDC86 CTTTCCCAAACCAGTCTCTG CAGAAGCCCCAGAGAATCTA  444 SSC8D8  1007_s_at DDR1 GCTTCTTCCTCCTCCATCAC CTGAAACACTGGACCTGGGG  445 OG7H7203304_at BAMBI GGCACGGGAAGCTGGAATTC GTATGACGGAGTCTTATCTG  446 DDC2D2201007_at HADHB TTTCAATAATCAGTTTACTG CTCTTTCAGGGATTTCTAAG  447 RRC7D7201710_at MYBL2 CCCATTCTCATGTTTACAGG GGTTGTGGGGGCAGAGGGGG  448 NNE2F2204729_s_at STX1A CATGTTTGGGATGGTGGCTC CTGTTGTCTTGCGCTCTGGG  449 IIE8F8217398_x_at GAPDH CTGCCACCCAGAAGACTGTG GATGGCCCCTCCGGGAAACT  450 LA1B1209899_s_at PUF60 TAGCCTCTGAGACTCATAAG GCCATCCAGGCCCTCAATGG  451 HHHC7D7212660_at PHF15 GCAATAGAATGTATGGTCAC CTGGGTGTGGCCAGTGCCCG  452 CCCC5D5206723_s_at LPAR2 GCAGCAGAGACTGAGGGGTG CAGAGTGTGAGCTGGGAAAG  453 TG2H2202423_at MYST3 ATCCCCTGTGAATCAGAGTG CACAAGCACCTCTCCTGTGA  454 RE6F6203570_at LOXL1 ACCAACAACGTGGTGAGATG CAACATTCACTACACAGGTC  455 UUE4F4202738_s_at PHKB ACATCCTTGGCGGGGTTATG GACCTCTTGCATGTCATAGC  456 UUA4B4221610_s_at STAP2 TTGGCCAGTCATCCTGAAGC CAAAGAAGTTGCCAAAGCCT  457 SSC4D4204549_at IKBKE TCACCACTGCCAGCCTCAGG CAACATAGAGAGCCTCCTGT  458 VE7F7203596_s_at IFIT5 GACTTAATTGGCATGGGGTG CAGTCCAGGCATCATGATTT  459UUE11F11 218255_s_at FBRS ACCTCTTAATGGCTCAGTCC CCTTCACCCCATTTCCAAGT  460PC2D2 201528_at RPA1 TCCCCTAAGGAAATCCGAGC GGCTACAAAGCGTTTCTTTA  461IIG9H9 201738_at EIF1B CTGCCTTGTGAAATGATTCC CTGCAGTAAACGGACTTTTC  462TG3H3 201146_at NFE2L2 CCTGCAGCAAACAAGAGATG GCAATGTTTTCCTTGTTCCC  463RRG6H6 221081_s_at DENND2D ATTGATTTCTCAGGACTTTG GAGGGCTCTGACACCATGCT 464 TTC7D7 218529_at CD320 GCCCTGTGCTTAAGACACTC CTGCTGCCCCGTCTGAGGGT 465 KKKC10D10 218086_at NPDC1 CCTCGGATGAGGAGAATGAG GACGGAGACTTCACGGTGTA 466 HHHG8H8 219051_x_at METRN GACGCTGAGCTGCTCCTGGC CGCATGCACCAGCGACTTCG 467 JJJA7B7 201014_s_at PAICS AACATCTGCGCATAAAGGAC CAGATGAAACTCTGAGGATT 468 MMC7D7 200757_s_at CALU AGAGCCTCACACCTCACTAG GTGCAGAGAGCCCAGGCCTT 469 CCCE4F4 201212_at LGMN TCCAGGACCTTCTTCACAAG ATGACTTGCTCGCTGTTACC 470 XC3D3 212850_s_at LRP4 CTGGCGAGCCCTTAGCCTTG CTGTAGAGACTTCCGTCACC 471 WE12F12 201243_s_at ATP1B1 AAAGCTGTGTCTGAGATCTGGATCTGCCCATCACTTTGGC  472 PPE4F4 202696_at OXSR1 CCCCTTGTCCCTGGAGTAGGGACTAACTATAGCACAAAGT  473 IIA12B12 222217_s_at SLC27A3GGCCGTTGCAGGTGTACTGG GCTGTCAGGGATCTTTTCTA  474 NNA5B5 212795_at KIAA1033CTGGAAACGAATTTAAATGG TGTCAAACTGCAGAGCAACA  475 MMMA1B1 212815_at ASCC3CTGCCGCATAAACTATAAAT CTGTAAGGTGGTACACAGCG  476 JJC1D1 203512_at TRAPPC3AAGCCACCCAGGTCTCATTC CTCCCTGCTGTTGGAGGCAA  477 TTC10D10 218948_at QRSL1ATGCGCATGGCAAGAACTTG CCTTACCCCAGATTCTCTAT  478 XE10F10 209224_s_atNDUFA2 CCCTTTGAACAACTTCAGTG CTGATCAGGTAACCAGAGCC  479 JJA7B7 205811_atPOLG2 TAGGAAGAGGCCCCACATTG GAACTAAGACAGGTTTGTCA  480 JJJE11F11 204608_atASL CTCAAGGGACTTCCCAGCAC CTACAACAAAGACTTACAGG  481 LE6F6 209161_at PRPF4TACAGTGAAGAAGACTTCAC CTCTTCCTATTGAGTTTGCT  482 JJJC12D12 205120_s_atSGCB CTCTTCAAGGTGCAAGTAAC CAGCCAGAACATGGGCTGCC  483 ZZC2D2 208634_s_atMACF1 ACCAGTAACTCTTGTGTTCA CCAGGACCCAGACCCTTGGC  484 YG4H4 202160_atCREBBP TTCTTGAATTCATGTACATG GTATTAACACTTAGTGTTCG  485 AAE7F7 201807_atVPS26A CAAAAGGGTCCATGTACCAC CATGTGCTGGAGCATCTGTT  486 ME4F4 205406_s_atSPA17 GCCTTCCGGGGACACATAGC CAGAGAGGAGGCAAAGAAAA  487 AAC2D2 214404_x_atSPDEF CCCCTGAGTTGGGCAGCCAG GAGTGCCCCCGGGAATGGAT  488 HHA6B6  57703_atSENP5 ATGCCCCGAGTGCGGAAGAG GATTTACAAGGAGCTATGTG  489 YA3B3 213720_s_atSMARCA4 GATGCATGTGCGTCACCGTC CACTCCTCCTACTGTATTTT  490 QQA4B4212047_s_at RNF167 AGCTTCTCCCTTACCCACAC CTATCCTTTTGAGGGGCTTT  491LLLG11H11 202083_s_at SEC14L1 CACCCAGCGGCGACATTGTA CAGACTCCTCTCACCTCTAG 492 PPG11H11 203919_at TCEA2 CCGTTGACACAGCTTCTCTG GAGACCCTAGAAGGCGGCAT 493 QC6D6 200666_s_at DNAJB1 CTCTGTATAGGGCCATAATG GAATTCTGAAGAAATCTTGG 494 AAG5H5 203409_at DDB2 GTTAAAGGGCCAAAAGTATC CAAGGTTAGGGTTGGAGCAG 495 PPA4B4 202623_at EAPP GGAAGATGCTGCCGAGAAGG CAGAGACAGATGTGGAAGAA 496 LLE10F10 212955_s_at POLR2I CACGAAGTGGACGAACTGACCCAGATTATCGCCGACGTGT  497 PPE1F1 202241_at TRIB1 CTAGAAACACTAGGTTCTTCCTGTACATACGTGTATATAT  498 QG6H6 203054_s_at TCTA CCCACCCACTAATACTACTGCACAGAGTCAGGATCTCACA  499 HHHA10B10 204514_at DPH2 GTTCAGACAGCCACATGAGGGGACAGTGCAGCTACAGGAT  500 KKKC3D3 208872_s_at REEP5 AATTAAAGCTATAGAGAGTCCCAACAAAGAAGATGATACC  501 NNG8H8 201125_s_at ITGB5 TGAGTCCTGAGACTTTTCCGCGTGATGGCTATGCCTTGCA  502 JJJE7F7 201127_s_at ACLY GGGGTACAGGCACCGAAGACCAACATCCACAGGCTAACAC  503 OG9H9 201558_at RAE1 GGGTTGAGGTTATTGTAGACGTTAGATTGCGGGCACCGCC  504 KKE8F8 201664_at SMC4 GGTTTACCAGGATGTAGTCCCACTGTTGAGGAGCATCTAT  505 SA1B1 203026_at ZBTB5 TGCCTCTCCACTGCTAGATGGAACCTGGAATCTCTCATCT  506 KKA6B6 202025_x_at ACAA1 AATGAGCTGAAGCGCCGTGGGAAGAGGGCATACGGAGTGG  507 MMG3H3 204978_at SFRS16 CAAGATCCGCATGAAGGAGCGGGAACGCCGAGAGAAGGAG  508 AAG1H1 202732_at PKIG ACCTCTGCCCTGTCCACCAGGATAAGTGACACCTAGGACC  509 LLA12B12 205667_at WRN AAATCAGCCTTCCGCAATTCATGTAGTTTCTGGGTCTTCT  510 NG1H1 202038_at UBE4A CATGCCAGAGGCTGATGCTGCACTGTTGATGTCATGTGAG  511 HHA4B4  89476_r_at NPEPL1 AGGACCCTCTGCTGAACCTGGTGTCCCCACTGGGCTGTGA  512 KKKG3H3 208950_s_at ALDH7A1CCTAAAGGATCAGACTGTGG CATTGTAAATGTCAACATTC  513 RRG3H3 218788_s_at SMYD3ATGCGACGCCAACATCAGAG CATCCTAAGGGAACGCAGTC  514 JJE8F8 209045_at XPNPEP1AGATGCCCCGACTTCTTTGG CCAGTGATGGGGAATCAGTG  515 LLG2H2 219459_at POLR3BCCTGGCTTTTGTCGTGGTGG CTGGCTCGGATAAATTTTCC  516 QQG9H9 206050_s_at RNH1CTGGCTCTGTGCTGCGGGTG CTCTGGTTGGCCGACTGCGA  517 LLG10H10 218064_s_atAKAP8L GCAAGAAGCTGGAGCGCTAC CTGAAGGGCGAGAACCCTTT  518 HHHE7F7 202185_atPLOD3 TGAATATGTCACCTTGCTCC CAAGACACGGCCCTCTCAGG  519 WWE4F4 201145_atHAX1 CTCAGGGGCTTGGATATGTG GAATAGTGAACTGGGGCCAT  520 PPG6H6 202812_at GAAAATAAGATTGTAAGGTTTGC CCTCCTCACCTGTTGCCGGC  521 VC9D9 202125_s_at TRAK2ATGCATGCAGACCTGTACTC CACATGCAACCCAACAGCAG  522 WA3B3 202927_at PIN1CCGAATTGTTTCTAGTTAGG CCACGCTCCTCTGTTCAGTC  523 MMG12H12 203306_s_atSLC35A1 ACTCGGACAATTTCTGGGTG GTGACTGAGTACCCCTTTAG  524 PG11H11 203727_atSKIV2L ACATCGTATTTGCGGCCAGC CTCTACACCCAGTGAATGCC  525 KKC11D11202829_s_at VAMP7 ATGGTACCTGTTCTTCTATC CAAACCTTTCAATTCATGCT  526 KKC8D8201513_at TSN ACTTAAGTGGCTAAAGAGAT GAGACAAACATGCAGGTCGC  527 EEA10B10220964_s_at RAB1B CCCCTCTGGTGTCATGTCAG GCATTTTGCAAGGAAAAGCC  528 LLE5F5203897_at LYRM1 GGTAGAGTCAGGTGAGAGTC CCTTGGTGAGTCATTTGTAC  529 AAA9B9203573_s_at RABGGTA GCCCTGCCCCCTACCCTTGC CCTTTAACTTATTGGGACTG  530TTE1F1 204089_x_at MAP3K4 CATTACTACTGTACACGGAC CATCGCCTCTGTCTCCTCCG  531MMMG2H2 219076_s_at PXMP2 TCCGGGTGCTCTTCGCCAAC CTGGCAGCTCTGTTCTGGTA  532MMC5D5 212648_at DHX29 ACGTCTTCTTTCTATTGATG GCTGGATCTATTTTCAGGCC  533ZZA9B9 212614_at ARID5B GTTGGCTGTTAGTGTATTTG ATATTCTGCCTGTCTCCTCA  534FFFC2D2 210986_s_at TPM1 CAGCTCATGACAATCTGTAG GATAACAATCAGTGTGGATT  535OOG1H1 203616_at POLB GGAAATACCGGGAACCCAAG GACCGGAGCGAATGAGGCCT  536AAA11B11 202491_s_at IKBKAP TTCCACTCATTCCTGTTGTC CTACCACCCCTTGCTCTTTG 537 QQC9D9 212500_at ADO GTGTGCATAAACTGTTAGTC GTGACTGACTTGGTGTGTTG  538EEEC11D11 202720_at TES TACTTCCAAGCCTGTCCATG GATATATCAAATGTCTTCAC  539HHG10H10 214259_s_at AKR7A2 TGAAAGGTGGGGGGTGAGTC CCACTTGAGCGCTTCCTGTT 540 TG11H11 201594_s_at PPP4R1 TCTTCACATACTGTACATACCTGTGACCACTCTTGGGAGT  541 PA6B6 217933_s_at LAP3 ACCAACAAAGATGAAGTTCCCTATCTACGGAAAGGCATGA  542 UG3H3 202868_s_at POP4 AGCCAATTCCATTTATAGACCACCTCCAGCCAGTGACGCT  543 IIA4B4 202949_s_at FHL2 CCAGGCAATCTTGCCTTCTGGTTTCTTCCAGCCACATTGA  544 UC9D9 209341_s_at IKBKB TTTGTTGGAGAAGAAAGTTGGAGTAGGAGACTTTCACAAG  545 ZZG4H4 201811_x_at SH3BP5 GATTTATTCTAAGAGAAGTGCATGTGAAGAATGGTTGCCA  546 YYC9D9 204143_s_at ENOSF1 ACCGATCAAGATGAGTTCAGCTAGAAGTCATACCACCCTC  547 UUE7F7 217931_at CNPY3 AAACTCACCATCCCTCAGTCCTCCCCAACAGGGTACTAGG  548 MG10H10 209100_at IFRD2 GGAGACTTTCTATGCCCTTGGTCCGTATTTTTAACAGAAG  549 ZA3B3 201466_s_at JUN TGCGATGTTTCAGGAGGCTGGAGGAAGGGGGGTTGCAGTG  550 VE5F5 202830_s_at SLC37A4 GGCCATCATTCTCACTGTACCACTAGGCGCAGTTGGATAT  551 ZZC8D8 218910_at ANO10 TGAGTGAGCCACCAGCTCTCCACGTTCCCCTCATAGCAGT  552 SSA12B12 203530_s_at STX4 GACAGTTCTTCTGGGGTTGGCAGCTGCTCATTCATGATGG  553 LLE7F7 203562_at FEZ1 GCGGGGTCCTTTGCCGTTGGCTTCTAGTGCTAGTAATCAT  554 NE11F11 209364_at BAD GGCGGAAGTACTTCCCTCAGGCCTATGCAAAAAGAGGATC  555 PPG9H9 203405_at PSMG1 TTGTCCATTGCTAGAACAACCGAATATAGTACACGACCTT  556 JJE2F2 203885_at RAB21 GTTCAGTGGTATGAGCAGAGGAAGAGATCCCAGATAGTAG  557 NNE6F6 219170_at FSD1 AAGCGAGGCAGTGCTACCAGCAGCTCCAACACCAGCCTCA  558 UE8F8 207939_x_at RNPS1 CGTTCATGGTGGTCTTTCAGGTTATCTTGGCAACATGTAC  559 MMMG5H5 221492_s_at ATG3 GTGATGAAGAAAATCATTGAGACTGTTGCAGAAGGAGGGG  560 HHC3D3 210719_s_at HMG20B GACCCTGGTGGGGGTGGCTCCTTCTCACTGCTGGATCCGG  561 HHE8F8 204605_at CGRRF1 AGAATGGGACTGTGAACTGGGTACTCTTACCATGCAGACA  562 PPC2D2 218450_at HEBP1 ATAGACCAGAAAAATCCTGGCAGCTTTTCTCCAGGCATCT  563 ZG2H2 212049_at WIPF2 TCTCAGTCCCTGGCCATGTGGTCAAGGTGGCTTTCTGTTA  564 PPC11D11 203848_at AKAP8 GCCCTGCTGTGTCAGTTTCCCTGTGGCCTTTTGAACTGTA  565 NNA2B2 204587_at SLC25A14 ACTTGGGCTAGAGCAGAAGGCATAGGCCAGGGTGGTTATT  566 BBBE8F8 204418_x_at GSTM2 TCTCCCGATTTGAGGGCTTGGAGAAGATCTCTGCCTACAT  567 YC1D1 203047_at STK10 TTCTCTTCAGGAAGAAAAAGCATCAGGGGGAAATGGAATG  568 IIC2D2 205451_at FOXO4 GTGTCAGCGCCTGGCCTACCCAGATTGTATCATGTGCTAG  569 PE11F11 203346_s_at MTF2 ACGTCGGGTGACACTTGATGGAAAGGTGCAGTATCTTGTG  570 OOE6F6 218571_s_at CHMP4A GGCTCCCTTCTCTTTGATAGCAGTTATAATGCCCTTGTTC  571 RG9H9 203241_at UVRAG GGTGTCTGGTAGGCAAACTGCAAGGCAGTTGAGATAGTTG  572 OOG11H11 201695_s_at NP GATGCCCAGGATTTGACTCGGGCCTTAGAACTTTGCATAG  573 RE8F8 203764_at DLGAP5 TTTCCTTCATATTATCAATGCTTATATATTCCTTAGACTA  574 NNG10H10 201631_s_at IER3 CTTTGTGGGACTGGTGGAAGCAGGACACCTGGAACTGCGG  575 SSG5H5 214221_at ALMS1 GGTGATTAAAATTCCTAATGGTTTGGGAGCAATACTTTCT  576 JJG12H12 219742_at PRR7 GCTTGGCGTCTGCCGGTCTCCATCCCCTTGTTCGGGAGGA  577 LE12F12 202016_at MEST TGATTCCTTTATGATGACTGCTTAACTCCCCACTGCCTGT  578 WA11B11 202108_at PEPD GCTTCGGCATTTGATCAGACCAAACAGTGCTGTTTCCCGG  579 MMA8B8 201074_at SMARCC1 GGAGTCCGAGAAGGAAAATGGAATTCTGGTTCATACTGTG  580 PE6F6 202780_at OXCT1 CCACATGGTTAAATGCATACCTTCCCAGTACTGGGGGGAA  581 HHHG11H11 209253_at SORBS3CTAGCCTGGCTCAAATATTC CCCAGGGAGACTGCTGTGTG  582 NE6F6 203256_at CDH3TACAGTGGACTTTCTCTCTG GAATGGAACCTTCTTAGGCC  583 PC8D8 208398_s_at TBPL1AGCAGAGCTGTCACAGTGTG CACTACCTTAGATTGTTTTA  584 OOE10F10 201519_atTOMM70A TCTCCCTTCTTTCATCTTGG GGTTGGGTAGAGAAACACAA  585 LA10B10217745_s_at NAT13 ACTATGTTAGTTGCATTTAG GTTTTAAAGCAAAGAATCTG  586 ZA11B11210811_s_at DDX49 AGGAGATCAACAAACGGAAG CAGCTGATCCTGGAGGGGAA  587NNC11D11 201887_at IL13RA1 GGTCTTGGGAGCTCTTGGAG GTGTCTGTATCAGTGGATTT 588 PPG3H3 202447_at DECR1 ACCAAGGAGCAGTGGGACAC CATAGAAGAACTCATCAGGA 589 SC12D12 202749_at WRB GAAATGTTTAGGGACATCTC CATGCTGTCACTTGTGATTT 590 IIE6F6 204285_s_at PMAIP1 CCGCTGGCCTACTGTGAAGG GAGATGACCTGTGATTAGAC 591 KKA10B10 201036_s_at HADH GAATGGGTCAGCATATCTCT GTTTGCATGGTTTGCAGGAG 592 NNE3F3 207877_s_at NVL CGGCAGAGAATCCCCCACAC GCTCTGAAGGACCCACTTTC 593 RRG7H7 203806_s_at FANCA GGAACCCACAGACCTCACAC CTGGGGGACAGAGGCAGATA 594 RRG12H12 201819_at SCARB1 CACTGCATCGGGTTGTCTGG CGCCCTTTTCCTCCAGCCTA 595 OG12H12 201709_s_at NIPSNAP1 CTGTTCCCTCACCCTGTATCCTGTCTCCCCTAATTGACAT  596 OOC7D7 221741_s_at YTHDF1 TGAGTTGAAGCATGAAAATGGTGCCCATGCCTGACGCTCC  597 KKE10F10 202916_s_at FAM20BCAATTCCTCAAGTCTGGGTG GTGACAAGGTAGGGGCTAGG  598 SG4H4 202148_s_at PYCR1GGTTTCCAGCCCCCAGTGTC CTGACTTCTGTCTGCCACAT  599 LC1D1 218316_at TIMM9CAGTAGCCACCATGTTCAAC CATCTGTCATGACTGTTTGG  600 QQC10D10 212894_atSUPV3L1 CCAGCCCCGATGCAGGAGAG CTGTCCCTTGCTTCCAGATT  601 QQA12B12215903_s_at MAST2 GCCAAGAACCAGGGGGCCAT CAAAAGCATCGGGATTTGGC  602 PPG8H8203285_s_at HS2ST1 TGCAGTGGCTGAACAAAGAG CATGGCTTGAGAATCAAAGG  603 SE4F4203594_at RTCD1 AAACAGGACCAGTTACACTC CATACGCAAACCGCGATACA  604 UUA2B2219384_s_at ADAT1 TACTACCTAGAGAAAGCCAG CAAAGAATGAAGGCAACAAA  605 SSE9F9201825_s_at SCCPDH ATTGATGCTGCCTCATTCAC GCTGACATTCTTTGGTCAAG  606 RC5D5204168_at MGST2 CCTAGGTGCCCTGGGAATTG CAAACAGCTTTCTGGATGAA  607 AAAA6B6221227_x_at COQ3 AGAAACAGAAGAGCTCCAAG CTAATGCCTGCACCAATCCA  608 UUC2D2219390_at FKBP14 TAGGACTTAAGCTGATGAAG CTTGGCTCCTAGTGATTGGT  609 YG8H8202184_s_at NUP133 AGTTCTTGTCCTGGTTCTAG CTGCTCACATGTACAAATCA  610 VE2F2202521_at CTCF ATATGTAATGGGGTTGAAAG CTGGGGAGGAGGATCTACTG  611 MMMC2D2209215_at MFSD10 TCAGTGACTCCGAGCTGCAG CACTCCAAGGCTGTCAGGGC  612 OOE8F8201174_s_at TERF2IP CCTTCTCAGTCAAGTCTGCC GGATGTCTTTCTTTACCTAC  613 PG1H1217758_s_at TM9SF3 ATCTGTTCAGGTTGGTGTAC CGTGTAAAGTGGGGATGGGG  614 LLC7D7212453_at KIAA1279 CCTTGTAAGAAAAAATGCTG GGTAATGTACCTGGTAACAA  6151NNE9F9 218435_at DNAJC15 CAAGGCTAAGATTAGAACAG CTCATAGGAGAGTCATGATT  616TTC11D11 209911_x_at HIST1H2BD CCACCCAAATCCAACTCATC CTGGTTTGCTGCACACTGGT 617 BBBE10F10 212115_at HN1L GGGAGAAGAAGAGTTCCTGC GCATGCAAGCCCTGCTGTGT 618 KKA7B7 217995_at SQRDL GCTAAGGGGTTACTGGGGAG GACCAGCGTTTCTGCGCAAG 619 LLLC5D5 210058_at MAPK13 CCTTCCTTGGCTCTTTTTAG CTTGTGGCGGCAGTGGGCAG 620 IIC5D5 218642_s_at CHCHD7 TTGCAGGATGAGTTGGGCAG GGAAAAGGGTCAGGGTTCAT 621 ZC5D5 204000_at GNB5 GCCCAGCCCTTCTTCTAGTG GTAGCTCTGGCTTTGCAGGC  622MMA4B4 208249_s_at TGDS TGATTCGGACAACCATGAGG GGTAGTGGTGCTAGGGAGAA  623FFFC8D8 218068_s_at ZNF672 AGGCCAAAACCATGTGGGTG CACAAAGCCAGGCACTGCCA 624 AAAC10D10 217901_at DSG2 CAAAGGATTTATATAGTGTG CTCCCACTAACTGTACAGAT 625 YYA6B6 213419_at APBB2 GAACTAACGCTGCGTCCTTG GAATGAATGATGCGTGAGTT 626 MC2D2 202683_s_at RNMT ATTCCCTTCCAGTTAACTAC CTCTCCAAGGGAAACCACTA 627 PPA10B10 203456_at PRAF2 TGCCCCTCACCCCAATGTTC CACACCATCGACAACCAAGG 628 PG5H5 201266_at TXNRD1 TCACGTCCTCATCTCATTTG GCTGTGTAAAGAAATGGGAA 629 SSG1H1 202261_at VPS72 GAAGTACATTACTGCCCATG GACTGCCGCCCACTGCCTCA 630 QQE8F8 209460_at ABAT CAGCAGAAGCTGGTAAAAAC ATGGGGAGCCCGGAGGACAG 631 RC9D9 213390_at ZC3H4 TGTGGATGAAATAGAAGCTG GAGCCCTCCTCTTGGAATAT 632 HHHG4H4 205036_at LSM6 ATCAGTACACAGAAGAGACG GATGTGAAGACACCAAGAGA 633 JJE4F4 204937_s_at ZNF274 GCCTTTTCAGCTTGACCCTG CAATATAACATGCACAGGCC 634 MMMA4B4 212624_s_at CHN1 TGCGTCCTGGGTAGTCTGTG CTTGTAATCCAGCATGTTTC 635 SE9F9 218350_s_at GMNN CCTCCACTAGTTCTTTGTAG CAGAGTACATAACTACATAA 636 JJA3B3 204484_at PIK3C2B ATAACTGGAGAAAGAAGCTC CATTGACCGAAGCCACAGGG 637 PPC1D1 202230_s_at CHERP AATCGGCCACACCTGGTGTC CATGGGCAGCCTGGTGCAAT 638 QQE1F1 204617_s_at ACD CCTTCCAGTATGAGTATGAG CCACCCTGCACGTCCCTCTG 639 KKE6F6 202761_s_at SYNE2 TTGAGCTGCCGGTTATACAC CAAAATGTTCTGTTCAGTAC 640 MMC10D10 202756_s_at GPC1 TCAGGAGCCCCCAACACAGG CAAGTCCACCCCATAATAAC 641 RRA10B10 204808_s_at TMEM5 TTGCTCCTATGGCTCCATTCCTGTGGTGGAAGACGTGATG  642 JJE6F6 205450_at PHKA1 CCTAATCACTCCAACCCTGCCCCTTTCTGTCCCATCCTTC  643 XG10H10 201875_s_at MPZL1 CTTTCCTGGTTGCAGATAACGAACTAAGGTTGCCTAAAGG  644 KKKA12B12 221482_s_at ARPP19GAAAGATTTGTATCTCTGTG CTTGAACTTGAATGGCCTTA  645 KKA11B11 202598_atS100A13 AAATCAGGAAGAAGAAAGAC CTGAAGATCAGGAAGAAGTA  646 JJG11H11218215_s_at NR1H2 CTTGCCTGACCACCCTCCAG CAGATAGACGCCGGCACCCC  647 XG6H6202689_at RBM15B CACTAAGGACATTGGGCAAG CTAGAAGAAGAACACATGGT  648 OOC3D3218050_at UFM1 CCCCGTTTCTTACAATAAAT GTTGAGTCTTAGTTAAGCAG  649 IIIC6D6205963_s_at DNAJA3 TGGTAGCATGTCGCAGTTTC CATGTGTTTCAGGATCTTCG  650IIIA5B5 201561_s_at CLSTN1 CCCTGACTGCTAGTTCTGAG GACACTGGTGGCTGTGCTAT 651 RC8D8 201899_s_at UBE2A GCTGACTGGGCACACTCATG CCAAGTTTCAGAATTATTGG 652 UUA7B7 219127_at ATAD4 CAAGTCACACACCCTCAAAG GGAAGCTACACGGGCCAAAT 653 MMC11D11 202811_at STAMBP GGGTGAGGGACAGCTTACTC CATTTGACCAGATTGTTTGG 654 ZZG6H6 208847_s_at ADH5 ATCCTGTCGTGATGTGATAG GAGCAGCTTAACAGGCAGGG 655 NNG4H4 212485_at GPATCH8 CAAACACAACTCTTGACTGC CCTCCCACCCTCCTACCTGT 656 RRA4B4 218852_at PPP2R3C GCTTCTGGACTTACGAGAAC AGAGAGGCTCTTGTTGCAAA 657 MA12B12 221732_at CANT1 GTGGCTGAATTGAGACCTTG CTGATGTATTCATGTCAGCA 658 UUE6F6 218780_at HOOK2 CCTGGCATCTCTGAACCTTC GCCCCACTGACAAGCACTGA 659 HHG12H12 217870_s_at CMPK1 TCATCAGGTATCTTTCTGTGGCATTTGAGAACAGAAACCA  660 HHA8B8 203709_at PHKG2 TGAAGAGGAGGGAGACTCTGCTGCTATAACTGAGGATGAG  661 JJG9H9 209724_s_at ZFP161 GGGGCAGTACCAGTCCATACCAGCTGCGATTTGTGAGTGG  662 ZZA3B3 202889_x_at MAP7 ACTTCCATGTACAACAAACGCTCCGGGAAATGGAAAGCCA  663 TTA11B11 218809_at PANK2 CAGTTGACTGGTTTTGTGTCCTGTTTGAACTTGCTGAATG  664 LG11H11 201489_at PPIF CAATGTGAATTCCTGTGTTGCTAACAGAAGTGGCCTGTAA  665 IIC10D10 201767_s_at ELAC2CCCTGCACACCAGAGACAAG CAGAGTAACAGGATCAGTGG  666 LLC10D10 212070_at GPR56TTGCTGGCCTGTTGTAGGTG GTAGGGACACAGATGACCGA  667 NNNA9B9 200929_at TMED10CTAAGGCATCCTACCAACAG CACCATCAAGGCACGTTGGA  668 AAAC2D2 220094_s_atCCDC90A GAAATAGTGGCATTGCATGC CCAGCAAGATCGGGCCCTTA  669 OOA5B5 212833_atSLC25A46 TCAGAGACAACATCCTTGTC CATATCCAAACCCAGTGTTT  670 YE2F2 202371_atTCEAL4 CTTTTGACCTATCTGCAATG CAGTGTTCTCAGTAGGAAAT  671 RRG1H1 218249_atZDHHC6 CTGGTTAAGATGTTCTTTTC CTCAAAGGTGCCCTAGTGCC  672 PPE9F9 203395_s_atHES1 TCCCTCCGGACTCTAAACAG GAACTTGAATACTGGGAGAG  673 IIE4F4 205562_atRPP38 GGCTCAGTGAGAGAATCGCC CCCGTCATTGGCTTAAAATG  674 QQG5H5 205750_atBPHL GGTGGTTCCTTCGTGTGGGG CTTGATCGTGTTGCTGCCTG  675 JJC11D11 212871_atMAPKAPK5 GTGATAGAAGAGCAAACCAC GTCCCACGAATCCCAATAAT  676 HHC6D6 201620_atMBTPS1 TCTTCTGACTGCAGGGGAAG GATGTACTTTCCAAACAAAT  677 UC7D7 202996_atPOLD4 GAGGCACCACGTAAGACCTC CTGCCCTTAGCTCTCTTGCT  678 IIIG12H12 218826_atSLC35F2 CAAAGAGTATGCCTGGGAGC CTCCAGCTGTTAAAAGACAA  679 RC10D10202626_s_at LYN GGGATCATCTGCCGTGCCTG GATCCTGAAATAGAGGCTAA  680 GGE5F5218397_at FANCL TCTTGGTATAAATACACTTC CACAGTCAGCACGGGGATCA  681 HHC2D2201548_s_at KDM5B TCAGCAAAGCTACAGGACTG GTACTCAAGCCAGCCTGTAA  682 YE5F5213689_x_at FAM69A CACACGTATACTCAGATTTG GCATGTACCTTTCAACATCT  683 VG8H8201223_s_at RAD23B CCCCTTCCCTCAGCAGAAAC GTGTTTATCAGCAAGTCGTG  684BBBC12D12 203627_at IGF1R AAGCAGTCAATGGATTCAAG CATTCTAAGCTTTGTTGACA  685MMMG1H1 217867_x_at BACE2 TATTAAGAAAATCACATTTC CAGGGCAGCAGCCGGGATCG  686UG2H2 204952_at LYPD3 CTTCTCATCCTTGTCTCTCC GCTTGTCCTCTTGTGATGTT  687KKG7H7 221449_s_at ITFG1 GGAAAAGAAAGCAGATGATA GAGAAAAACGACAAGAAGCC  688MMA12B12 203124_s_at SLC11A2 TTGGCTCCCTTGAGGTTCTG CTAGTGGTGTTAGGAGTGGT 689 EEE10F10 202362_at RAP1A AATATGATTATACAAAAGAG CATGGATGCATTTCAAATGT 690 MMME7F7 212449_s_at LYPLA1 TAATAAAGGCTAGTCAGAACCCTATACCATAAAGTGTAGT  691 VVC12D12 209015_s_at DNAJB6GCCGTTCATGTTGCTTTCTC CTTTGTCCTCTTGGACTTGA  692 MMC4D4 209662_at CETN3ATGGAGAAATAAACCAAGAG GAGTTCATTGCTATTATGAC  693 CC5D5 200618_at LASP1GGGGTTGTTGTCTCATTTTG GTCTGTTTTGGTCCCCTCCC  694 DDA9B9 217971_at MAPKSP1TACATTGATCCACTTGAGCC GTTAAGTGCTGCCAATTGTA  695 LE9F9 218595_s_at HEATR1AGTGCCAAAAGACTATTCAG CAACTGGAAACTGTCCTGGG  696 KKA9B9 201735_s_at CLCN3GTCTCGAAGGAAGCGAGAAC GAAATCTCTCATTGTGTGCC  697 QQC11D11 213531_s_atRAB3GAP1 GGAGCTCAAGATGTCTTGTG TCTGTGTGGCTAGATGGCCT  698 SSG11H11203447_at PSMD5 AAATTATTTTAAAGTGACTG GAATTATCTAGTCCCCAGAT  699 HHHC6D6212345_s_at CREB3L2 GGTTTTAGCTCTGTTCTCTG CTCCCATCCTTCGCTCACCA  700JJG8H8 209179_s_at MBOAT7 CCCTGGGCAGTGGGTTTTGG GCAAATTCCCTTTCTTTGCA  701JJE3F3 202093_s_at PAF1 GTGATGCTGATTCTGAGGAC GATGCCGACTCTGATGATGA  702UUG3H3 219363_s_at MTERFD1 TTTGTGCACAATGTGATGAG CATTCCCCACCACATCATTG 703 WWA8B8 203094_at MAD2L1BP GATTTCCTGATAGGCTGATG GCATGTGGCTGTGACTGTGA 704 MC3D3 202458_at PRSS23 TGACACAGTGTTCCCTCCTG GCAGCAATTAAGGGTCTTCA 705 HHA10B10 202708_s_at HIST2H2BE AGTGATTCAGCTGTTTTTGGCTAAGGGCTTTTGGAGCTGA  706 OE5F5 202847_at PCK2 AGTCTAGCAAGAGGACATAGCACCCTCATCTGGGAATAGG  707 IIIG3H3 201331_s_at STAT6 GCTGCATCTTTTCTGTTGCCCCATCCACCGCCAGCTTCCC  708 MMA6B6 218961_s_at PNKP AAGGCTTCTCTGCCATCCTGGAGATCCCGTTCCGGCTATG  709 TTA2B2 211015_s_at HSPA4 GGCAGATAGACAGAGAGATGCTCAACTTGTACATTGAAAA  710 QE1F1 212231_at FBXO21 CTCCAGGAAGCCTGTATCACCTGTGTAAGTTGGTATTTGG  711 TTE11F11 215497_s_at WDTC1CCGAGCCTTTTTGTTGCTCC GCTCCCAGGAGAGTGAGGGT  712 RRC4D4 219016_at FASTKD5CTCGGCTTGGCTACCGTGTG GTAGAGTTATCCTACTGGGA  713 LLA2B2 218542_at CEP55TGTTCCCCAACTCTGTTCTG CGCACGAAACAGTATCTGTT  714 OOG5H5 218358_at CRELD2GATGTCCCGTGGAAAATGTG GCCCTGAGGATGCCGTCTCC  715 SC11D11 209586_s_at PRUNECCTACCCCACAGCTCTGTTC CATGTAAGTTGCCAACAGTT  716 FFFC1D1 218113_at TMEM2ATGGCCTCTACCTTTGTATC CAGGAGAAACTGCAGAGCAG  717 TTA8B8 220661_s_at ZNF692ACTGGGCTGTAGGGGAGCTG GACTACTTTAGTCTTCCTAA  718 VE6F6 209394_at ASMTLCATGCTGGTGCAGACTGAAG GCAAGGAGCGGAGCCTGGGC  719 PE7F7 202109_at ARFIP2TTGCTGCCCTGTCTATCTTC CTGGCCACAGGGCTTCATTC  720 MG5H5 202528_at GALEAGGCTCTGGCACAAAACCTC CTCCTCCCAGGCACTCATTT  721 LLG11H11 201870_at TOMM34GTTTTTTGTTCCAACAGTGG CCTTCTCCGGGCTTCATAGT  722 BBBC7D7 210473_s_atGPR125 GGACCAATTAAAAGCAATGG GCAGGAGGGACCCTTGCTCG  723 IIIC9D9218744_s_at PACSIN3 GGCTGAGGGCAAGATGGGAG GTCAGAGGTGACAGAAGCGT  724 WC1D1  1053_at RFC2 TACAGGTGCCCTATTCTGAG GTACAGGAGCCGCGGCTTTC  725 JJE11F11217809_at BZW2 ATGGAGCCCTGAGGCATCAG CTATTATACTTGGGACTCTA  726 TTE8F8219270_at CHAC1 ACAGGCCCTGGCAACCTTCC CAGTCTGTCCCATACTGTTA  727 KKE7F7219082_at AMDHD2 TCGACGACTCCCTTCACGTC CAGGCCACCTACATCTCGGG  728 YG7H7201968_s_at PGM1 CATGCCCTCCTGCATTGCTG CTGCGTGGGTATTTGTCTCC  729 SE3F3202722_s_at GFPT1 GCAGTGTATGCTCATACTTG GACAGTTAGGGAAGGGTTTG  730 QQG4H4205251_at PER2 CTCTCAGAGTTTCTGTGATG ATTTGTTGAGCCTTGCTGGA  731 UUG7H7201416_at SOX4 GCACGCTCTTTAAGAGTCTG CACTGGAGGAACTCCTGCCA  732 OOG10H10201531_at ZFP36 CTCAAATTACCCTCCAAAAG CAAGTAGCCAAAGCCGTTGC  733 JJJE5F5203336_s_at ITGB1BP1 CTGAAGACCACAGATGCAAG CAATGAGGAATACAGCCTGT  734FFFE1F1 212282_at TMEM97 CCATATTGGCCCGATTAGTG GTACTGTCTGACTCACGTGT  735KKA5B5 213995_at ATP5S TGTGCAAGTGTCATTATATC GAGGATGACTGTTTGCTGAG  736AAA4B4 213918_s_at NIPBL GGAGTCAACGTATTTCGCAG CGTATTACGTAAAATGATTT  737PPC7D7 202854_at HPRT1 ACTATGAGCCTATAGACTAT CAGTTCCCTTTGGGCGGATT  738KKA1B1 221549_at GRWD1 GAGGTGTGGGTTCCTCCAAC ACAATTTGCTTCTGCCCGTT  739LLLA3B3 202900_s_at NUP88 CCATTATTCTCAGTGCCTAC CAGCGAAAGTGCATTCAGTC  740NA2B2 201673_s_at GYS1 GCCCACTGTGAAACCACTAG GTTCTAGGTCCTGGCTTCTA  741RE5F5 217777_s_at PTPLAD1 AGGCTCAGCCCACCCCAACC CTATCTCATGTTCAGTCTGT  742MME1F1 200843_s_at EPRS TCAAACCACTCTGTGAACTG CAGCCTGGAGCCAAATGTGT  743RRE1F1 218175_at CCDC92 GGCACCGATCACCGAGCAGC CGTGCGTGTATCTCAAGGAA  744HHG11H11 204711_at KIAA0753 GGCTCAGTGAAGGAAACATG CAGAAAGAATGCCTGAGACG 745 NA11B11 218001_at MRPS2 TCAATCTAAATGCCTTTCAG GTGGGCCGCTTCCTTGGCTA 746 PPA11B11 203775_at SLC25A13 CAGACAGAAAAAACTGAGATGTAGCCCCTCTCCTGGAAGT  747 QQE6F6 205895_s_at NOLC1 GGGAACCCTCAGGTCTCTAGGTGAGGGTCTTGATGAGGAC  748 HHHG12H12 209262_s_at NR2F6TAGCATGAACTTGTGGGATG GTGGGGTTGGCTTCCCTGGC  749 IIIE8F8 218828_at PLSCR3CTGCCTTCAGCTGGTGCTTG CTGCGATTCCTGTGCCTTAT  750 AAAC11D11 203303_atDYNLT3 GAGCGGAACCATAACTCATT GAATTTTGGAGAGGAATAAG  751 TTE3F3 216913_s_atRRP12 CCTGGACTCAGGATGACTTG GAACTAGGGCTTGGCTCTCA  752 OOA11B11201572_x_at DCTD AGCTTACTGCAGCACTGTTG GTGTTCGGAGCTCTTCTGTG  753 MMA10B10202734_at TRIP10 GGACCTATGCACTTTATTTC TGACCCCGTGGCTTCGGCTG  754 OE1F1203258_at DRAP1 GAAGATTACGACTCCTAGCG CCTTCTGCCCCCCAGACCAT  755 GGA6B6217734_s_at WDR6 TTGTAGTAGGAGCTGAAATC CATGCTGAGCTGTACCAGGA  756 XC10D10203905_at PARN TTGAAACAGATCACAGCAAC GACAAACGCTCATGGCGCTG  757 ME7F7218577_at LRRC40 ATTGACTTGAATATGACTAG CCAGTTTCTATGTTTTTGTT  758 BBBG7H7209409_at GRB10 ACAGTATGACCGATCTCTGC GCCTTTCTGGGGGCGGGCAA  759 NG11H11201098_at COPB2 TCCTACTCCGGTTATTGTGG CCTCCCACACAGCCAACAAA  760 TTC3D3216321_s_at NR3C1 GTCCACCCAGGATTAGTGAC CAGGTTTTCAGGAAAGGATT  761 VA10B10201995_at EXT1 AGAAATACCGAGACATTGAG CGACTTTGAGGAATCCGGCT  762 JJC3D3204742_s_at PDS5B TGCTGCAGTGCAACAGGAGG CTTTTTCAGTGATCTTCACT  763 SSE2F2212180_at CRKL CAGGAGGAACAGTGGCCTTG CTTCTTAGACGGTCTTCACT  764 HHA3B3203171_s_at RRP8 ACAAGCGCAGGTGACCTCTG GATCTTCCTTGAAAGGGGAG  765 MMMC5D5209608_s_at ACAT2 CTTTGCAGCTGTCTCTGCTG CAATAGTTAAAGAACTTGGA  766 PPA8B8203046_s_at TIMELESS CCTTTGGCTTTCTCTTGGAG GTGGGTCGCAGCACCAGATG  767QG9H9 203341_at CEBPZ CAAACAGCTTAGATGGGAGG CTGAACGTGATGACTGGCTA  768OOA8B8 201153_s_at MBNL1 TCCAGCCTTCACTCCAGCTG GTTAAAAATGTTGCACTTAT  769NC6D6 207831_x_at DHPS AAACCTTTGCCCAGAAGATG GATGCCTTCATGCATGAGAA  770IIE10F10 201778_s_at KIAA0494 GTCACAGTTGAGGATTTTGG CTGTGATGGGCTCATACTCA 771 JJA10B10 210151_s_at DYRK3 GTATTGCCAAAACTGATTAGCTAGTGGACAGAGATATGCC  772 OOG6H6 218743_at CHMP6 GTTATGAGACGATCTCGCTGGGACCGCCCCTGCCCGTGGA  773 IIC8D8 200791_s_at IQGAP1 AAGGCCACATCCAAGACAGGCAATAATGAGCAGAGTTTAC  774 IIG1H1 205055_at ITGAE CTTGGAGAGCATCAGGAAGGCCCAGCTGAAATCAGAGAAT  775 MMMG4H4 201503_at G3BP1 AAGAAGGAATGTTACTTTAATATTGGACTTTGCTCATGTG  776 HHC5D5 217900_at IARS2 GTCTTCAGATACACTGTGTCCTCGATGTGCAGAAGTTGTC  777 JJE7F7 206015_s_at FOXJ3 TTTTGTGCAGATACAACCTGCTCTCTGTACTGCTGTTGGA  778 KKKA5B5 210153_s_at ME2 CCAGTGAAACTTACAGATGGGCGAGTCTTTACACCAGGTC  779 NNNA7B7 203328_x_at IDE GGAAATGTTGGCAGTAGATGCTCCAAGGAGACATAAGGTA  780 RRC2D2 218474_s_at KCTD5 GCATCCTCTCTGGGGAGCTGCTGGCCGCTTAGCGTTGTTT  781 ZZC4D4 202429_s_at PPP3CA ACCCAAACAAAGATGTTCTCGATACAGTCTGGCAAAGACT  782 RRA9B9 203911_at RAP1GAP TGGCCCCAATACCCATTTTGGAAGCCCCTGTGGCCGTGTG  783 LLLG7H7 215116_s_at DNM1 ACTACCAGAGAACGCTGTCCCCCGACATCCCACTCCAAAG  784 IIIG2H2 213844_at HOXA5 AACTCCCTTGTGTTCCTTCTGTGAAGAAGCCCTGTTCTCG  785 TTG11H11 218547_at DHDDS GCATCTCTCTTTGGCCTGAGGTTCTGTATTCTGGGAAAGG  786 TG5H5 203521_s_at ZNF318 ATTGAACTCATTCCCTGTTCCACAAACCCATATGTATCCT  787 TG7H7 213150_at HOXA10 CTAGGAGGACTGGGGTAAGCGGAATAAACTAGAGAAGGGA  788 TG9H9 203720_s_at ERCC1 GTACCTGGAGACCTACAAGGCCTATGAGCAGAAACCAGCG  789 NNA1B1 203546_at IPO13 AGAGGCGGGTGAAGGAGATGGTGAAGGAGTTCACACTGCT  790 IIG10H10 202388_at RGS2 TGCAGTGTCCGTTATGAGTGCCAAAAATCTGTCTTGAAGG  791 XC12D12 200617_at MLEC TTTCCCATCCTCTCTCTGTGGAGGCCAAACCAACTCTTTG  792 OC7D7 213233_s_at KLHL9 ACCAAGGCAAAATGAATTGGCTTCTAGGGGTCTGAACCTT  793 SG12H12 212997_s_at TLK2 TCCGTCTGGTCTCCTGTTTGCAATTGCTTCCCTCATCTCA  794 JJA11B11 212689_s_at KDM3AGGCTGTAAAAGCAAAACCTC GTATCAGCTCTGGAACAATA  795 HHC9D9 212189_s_at COG4CAGCAGAGAAACAAAGTCTG GACCCACTCCATGCTCTGCC  796 OOC1D1 202911_at MSH6TAGGACATATGGCATGCATG GTAGAAAATGAATGTGAAGA  797 NNNE3F3 200698_at KDELR2ACAAAAGCTCTGTAGGGCTG CAGACATTTAAAGTTCACAT  798 VVG7H7 201913_s_at COASYGTCCAAGCTATACTGTGCAG GACATGGCCAGGCCTGGTGG  799 SE10F10 202604_x_atADAM10 GCTCGACCACCTCAACATTG GAGACATCACTTGCCAATGT  800 MMA1B1 202910_s_atCD97 TGTCCCATCCTGGACTTTTC CTCTCATGTCTTTGCTGCAG  801 VG9H9 205051_s_atKIT TCTATGCTCTCGCACCTTTC CAAAGTTAACAGATTTTGGG  802 LLLA6B6 202772_atHMGCL GCTGGCAGAGGCCATTTGTG GAAAGTGGAGAGCTACGTGG  803 KKC3D3 218667_atPJA1 GTTCCCTCCCCCACTCTAAA GACCAAGGCCGTTTACTCCT  804 CCCG3H3 203726_s_atLAMA3 GGTGGCAGTCACCATAAAAC AACACATCCTGCACCTGGAA  805 KKKA10B10217960_s_at TOMM22 CGGAGAAGTTGCAAATGGAG CAACAGCAGCAACTGCAGCA  806 RRE3F3218755_at KIF20A TCCTACGCTCACGGCGTTCC CCTTTACTCAAATCTGGGCC  807 RRE4F4219069_at ANKRD49 GATAGTCCTACCTCACCCTG GTCAACCTACATGATCCTTA  808 OOA1B1202880_s_at CYTH1 TTTCCTAGACAGAGAGGCAC CTGGGTCAGTATTAGTCTAT  809 RE4F4200825_s_at HYOU1 AGCTAGGGCTGCTGCCTCAG CTCCAAGACAAGAATGAACC  810 LA4B4214061_at WDR67 TCTTTTGGCTGCATAGAATG CATGTCACCTTGAGACGGTC  811 SE7F7204772_s_at TTF1 CACTAAAATCCAGACTCCTG CAGCACCCAAGCAAGTTTTC  812 NNA9B9201178_at FBXO7 GTGGTATGACCCAAAGGTTC CTCTGTGACAAGGTTGGCCT  813 LLC6D6204611_s_at PPP2R5B GTCTATTTATTCTCGCCCAG CTCACCCTCTACACAGACAC  814 ZG3H3202500_at DNAJB2 ACCCTGCTGCCCATTCTTTC CAACATCACAGATGAACTGC  815 YYC11D11201347_x_at GRHPR GTAGCCAAACAGTAGAGATG GAGGGCCGGGAAGCAAACCG  816 RA7B7214106_s_at GMDS TGGGTCGCTTTGCGTTTGTC GAAGCCTCCTCTGAATGGCT  817 ZZG7H7205640_at ALDH3B1 AAACCTACATTTGGACAATG AGAGGCTGCTCCTGCGGCCT  818 HHHA9B9205379_at CBR3 GACAGGATTCTGGTGAATGC GTGCTGCCCAGGACCAGTGA  819 OA12B12204662_at CP110 AGCTTATTCATAGCATTGTG GGTCTCTCCAGTAAGAAAGA  820 YA4B4202174_s_at PCM1 AAGCTCTCTGGCTGGAAGTC CTGATACTGAATCTCCAGTG  821 JJJG7H7201351_s_at YME1L1 CAGAAACCCAATCTGCCATC GAACAAGAAATAAGAATCCT  822 ZE7F7202032_s_at MAN2A2 AGAAACTAGCCAAGGGCAAG CTATTATTCAGCAGTGTCCC  823AAAE10F10 205741_s_at DTNA CTGTCACCACAGAGATTGGC CTACGGTTTCTGTTTTGAGG 824 GGGA6B6 220091_at SLC2A6 GCCCAACCTCTGGGAACAGG CAGCTCCTATCTGCAAACTG 825 AAAE3F3 203213_at CDC2 AAGTCTTACAAAGATCAAGG GCTGTCCGCAACAGGGAAGA 826 BBBE12F12 205227_at IL1RAP CGTTCCATGCCCAGGTTAACAAAGAACTGTGATATATAGA  827 LA3B3 203566_s_at AGL TGCTTCATACTTGAGTGATGCTGGATAAGGTATTGTATTT  828 LC5D5 214741_at ZNF131 CGTTGAAACACATTGATTCCCCTCCCCCTACTTATTGCCA  829 YYA11B11 213343_s_at GDPD5AGCAGACCTCAAGGCAGAAG GGTCACCTAACCCAGGAGTC  830 LLLG6H6 210115_at RPL39LACTTGAAAAAGTGGTGTGTG GTTGACTCTGTTTCTCGCCA  831 LLC4D4 218104_at TEX10GAGGAGCTGCCTGTTGTGGG CCAGCTGCTTCGACTGCTGC  832 EEEE7F7 203127_s_atSPTLC2 AAAATTGGCGCCTTTGGACG GGAGATGCTGAAGCGGAACA  833 RRG9H9 203209_atRFC5 ACGCACTTGTTTTCATGCAG GAGCGGGGCAAGTAAGGTTG  834 IIA11B11 202441_atERLIN1 CCCTCTCAGCTCTGAGGCTG GCCGTCTTTCGGGGTGTTCC  835 KKA8B8 201011_atRPN1 AAACCAGGCCCTGCGTCAGG CAGTGTGAGTTTGCCGTTTG  836 BBBE7F7 219327_s_atGPRC5C ATGGGTGTCCCCACCCACTC CTCAGTGTTTGTGGAGTCGA  837 IIA7B7 205085_atORC1L GCCGTGTGTTCTCACCTGGG CTCCTGTCGCCTCCTGCTTG  838 VVC7D7 210416_s_atCHEK2 CTGTCTGAGGAAAATGAATC CACAGCTCTACCCCAGGTTC  839 LLG6H6 212830_atMEGF9 CCCTAGAAAGTAAGCCCAGG GCTTCAGATCTAAGTTAGTC  840 AAAA7B7 214074_s_atCTTN TGTGTTTTAAACAGAATTTC GTGAACAGCCTTTTATCTCC  841 KKC7D7 202908_atWFS1 CCTGCCAGTGTTTAGAAGAG CCTGACTGTGTTCAGTGCCT  842 HHE4F4 212968_atRFNG CGCTCTGACTTGTGGCTCAG GACTACTTTCTGGGTCGTGC  843 IIIE1F1 919665_atTIPARP CTGTTGTTTGCTGCCATTGG CATGAAATGGCCAACTGTGG  844 WWG11H11 208717_atOXA1L TTTTCCCTGGTCCAAGTATC CTGTCTCCGGATTCCAGCAG  845 LC4D4 203557_s_atPCBD1 TTTAGACCTTTTCCCTGCAC CACTCTCTTCATCCTGGGGG  846 AAE2F2 201579_atFAT1 AGTGTAACGGGGACCTTCTG CATACCTGTTTAGAACCAAA  847 SSA5B5 202006_atPTPN12 GTTTCTGAATTTTAAACTTG CTGGATTCATGCAGCCAGCT  848 OOE4F4 211783_s_atMTA1 GTTTACTTTTTGGCTGGAGC GGAGATGAGGGGCCACCCCG  849 YA7B7 201260_s_atSYPL1 TTGTTTCCTGTCCTTTGTTG CTCATGCTGTTTAAGTGCAG  850 QQC1D1 215884_s_atUBQLN2 GAAGGATCAGTGTAGTAATG CCAGGAAAGTGCTTTTTACC  851 IIIA2B2 203418_atCCNA2 CTCATGGACCTTCACCAGAC CTACCTCAAAGCACCACAGC  852 TTG12H12 221779_atMICALL1 GGAAGAGGCTCGCTCCCGCC CATGGTCATCACTGGTCTGT  853 JJJG3H3 203167_atTIMP2 AAGAAGAGCCTGAACCACAG GTACCAGATGGGCTGCGAGT  854 KKA12B12204998_s_at ATF5 AGTGTTTCGTGAAGGTGTTG GAGAGGGGCTGTGTCTGGGT  855 MA11B11217830_s_at NSFL1C CCCTGCAATGAGCCAAGAAC CAACACTACATCCACCTAGA  856 ZZA7B7217761_at ADI1 AATTCCGAGATAGGATTATG CCTAGTTTGTCATATCACAG  857 ZZE12F12218168_s_at CABC1 GAGCTGGGAGAGGTGCTGAG CTAACAGTGCCAACAAGTGC  858 MMC6D6219821_s_at GFOD1 AAAGTGAGCCTAGCCAGGAG GTGTTTGGGGCTCTATCGCG  859 IIIE3F3203648_at TATDN2 TGCAGGTGAAACCAACCAGC CCTGTGTTAGAGGAGGAAAA  860 MA3B3203250_at RBM16 GTCAAGGAAATGAATAACAG CTTGTCAGAGACTTCCTATG  861 RA2B2202040_s_at KDM5A AGCCCTGACCCCAATGTCTG CTGTTTCCAACACTGGTGAT  862ZZC12D12 211725_s_at BID CCTGGAGCAGCTGCTGCAGG CCTACCCTAGAGACATGGAG  863SG6H6 203208_s_at MTFR1 TTCCTGGCTGGGAGTATTAG GAGATGGGAGTAGAGATTCA  864TTG8H8 220140_s_at SNX11 AGACAATGAGGCATTCTGTC CTCCTGCTGCCATTCTTCAT  865UUA12B12 201080_at PIP4K2B ACAACTGTTCCCCAATCTAC CAGCCATCTGCAGGGGTCAG 866 NNG9H9 201250_s_at SLC2A1 GATTGAGGGTAGGAGGTTTG GATGGGAGTGAGACAGAAGT 867 PPG12H12 204126_s_at CDC45L CTGAAAGCTGAGGATCGGAGCAAGTTTCTGGACGCACTTA  868 MMA9B9 202220_at KIAA0907 TCTCCCAGAACTGGTTGCAGCTAAAACAGAGAGATCTGAC  869 SSE3F3 218742_at NARFL GAGCAAGACGGGTTCTCACCCCTGACTTCTGGAGGCTTCC  870 QA2B2 208424_s_at CIAPIN1 CCCACTTTAGAAGAGTCCAGGTTGGTGAGCATTTAGAGGG  871 SSC5D5 212644_s_at MAPK1IP1LTTAGGGAACCTTAAGTCATG CAGACATGACTGTTCTCTTT  872 YYA1B1 205480_s_at UGP2AGCGGGAATTTCCTACAGTG CCCTTGGTTAAATTAGGCAG  873 ZG1H1 203499_at EPHA2AGTCGGCCCCATCTCTCATC CTTTTGGATAAGTTTCTATT  874 GGGE7F7 204949_at ICAM3CATAATGGTACTTATCAGTG CCAAGCGTCCAGCTCACGAG  875 LLLG3H3 219654_at PTPLAGTGTGGTGCTTTTTCTGGTC GCGTGGACTGTGACAGAGAT  876 ZE6F6 215093_at NSDHLCACCCTACTCTTTCCGTGAC GATGAGGGCGGCAAAAACAG  877 QQE2F2 204826_at CCNFGGGTGAGAACCCAAGCGTTG GAACTGTAGACCCGTCCTGT  878 ZE12F12 201756_at RPA2GAGAAACCTGCTGGCCTCTG CCTGTTTTCATTTCCCACTT  879 OA2B2 202678_at GTF2A2AGGCTATAAATGCAGCACTG GCTCAGAGGGTCAGGAACAG  880 NC1D1 221230_s_at ARID4BTCTTTGTTTCCTGGCAATAC GACGTGGGAATTTCAATGCG  881 JJA1B1 203155_at SETDB1TGATCCCTTCCAATGTGGTG CTAGCAGGCAGGATCCCTTC  882 JJC10D10 212458_at SPRED2CCGACCCCCCAAGCTATTTG CTCACATTAACAAATTAAAG  883 OG8H8 213153_at SETD1BGAGTTTTAGGGATGTTTGTG CGGGTAGACTCCATCATCCA  884 LLLG2H2 208690_s_atPDLIM1 TGAGTCCCCTCCCTGCCTTG GTTAATTGACTCACACCAGC  885 SA8B8 218102_atDERA TGCCCTAGCAGAGGAAAATG CAACATCTCGCAAGCGCTGC  886 AAAC7D7 211919_s_atCXCR4 CCGACTTCATCTTTGCCAAC GTCAGTGAGGCAGATGACAG  887 TG4H4 203343_atUGDH TGCTGAGAATGTACAGTTTG CATTAAACATCCCAGGTCTC  888 QG5H5 203464_s_atEPN2 GCTGTTTCTCAGTCCCAGAG GCCGGTGGCTGGTTTTGAAC  889 QQG3H3 205173_x_atCD58 CCAAGCAGCGGTCATTCAAG ACACAGATATGCACTTATAC  890 YYE2F2 212399_s_atVGLL4 TGCCTGCAGTGCGCTCTGAC CTTCTCTTCATGTGTGTAAA  891 RRA7B7 221552_atABHD6 TGTTCTGAGTGAACCCACAG CAGTCGCAGAATGAGCACCT  892 NNE7F7 220127_s_atFBXL12 GGGCACCTGAGGGTCTGAGC CCCCTTATGAGTACCCAAGA  893 MMG5H5 217873_atCAB39 AGGTCGTAGCCTTTTAGGTG GAAGAAGTGAGGGTGCAGCG  894 QE2F2 203342_atTIMM17B CGAAGTTCTCACCCCAGCTC CTTTGTGTGGCACCCTGATG  895 PA12B12201697_s_at DNMT1 ACATGGTGTTTGTGGCCTTG GCTGACATGAAGCTGTTGTG  896 RRC5D5221887_s_at DFNB31 CCTCCAGCTAGGACCCAGCC CATCCCCAGATGCCTGAGCC  897OOC11D11 201608_s_at PWP1 AGTGGCCCTTTTGGCAGCAG GAGCTCAGATACACCCATGG  898QQG12H12 217168_s_at HERPUD1 GCTGTTGGAGGCTTTGACAG GAATGGACTGGATCACCTGA 899 MG2H2 201847_at LIPA GGTTGCCCATGAGAAGTGTC CTTGTTCATTTTCACCCAAA  900KKKC12D12 221641_s_at ACOT9 ACTCTACCCACAGTGACGTG GTATCTGATGAAGACCTGAT 901 LLC9D9 207871_s_at ST7 CTGTGGCACCAGCTAACACG GATCTGAGAGAAGCCCTGTC 902 YC6D6 208407_s_at CTNND1 ACCACTGGGCCATAATGTTG CTTCTCAGGCTATATGCAGT 903 LLC2D2 218581_at ABHD4 GGTGGTTCCCACTGCATGAC CCTCTATCCCTGCCATCTGT 904 AAE4F4 201626_at INSIG1 ATTTCCAATGAAGATGTCAG CATTTTATGAAAAACCAGAA 905 QE10F10 203989_x_at ZNF160 GAAGAGAGAGGCCAGGCGCGGTGGCTCACACCTGTAATCC  906 NG6H6 202494_at PPIE TGGGCCTCTCCTGGGACTACCAGTGTGGCTCTTACGTGTT  907 ZA1B1 201628_s_at RRAGA AGTGGGCTTTGAAGTGTGTGCTGCTTACTCCTTTCATCTT  908 ME5F5 207467_x_at CAST CTCCAAAGCACCTAAGAATGGAGGTAAAGCGAAGGATTCA  909 IIA1B1 217911_s_at BAG3 TGCAGCCCTGTCTACTTGGGCACCCCCACCACCTGTTAGC  910 NNC8D8 201040_at GNAI2 TGTCTTGTTCTGTGATGAGGGGAGGGGGGCACATGCTGAG  911 MC5D5 203120_at TP53BP2 CCTGCCAGAAAGGACCAGTGCCGTCACATCGCTGTCTCTG  912 SC5D5 202825_at SLC25A4 AACCAGACTGAAAGGAATACCTCAGAAGAGATGCTTCATT  913 YG11H11 201644_at TSTA3 GGGCAGTTTAAGAAGACAGCCAGTAACAGCAAGCTGAGGA  914 VC8D8 202599_s_at NRIP1 TCCCATTGCAAACATTATTCCAAGAGTATCCCAGTATTAG  915 IIIE4F4 215945_s_at TRIM2 CGCTGTGCATCAAAGTGTTTGTATGTTCGTAGCTACATAC  916 NNC10D10 201397_at PHGDH GAGAAAATCCACATTCTTGGGCTGAACGCGGGCCTCTGAC  917 MMMC7D7 209163_at CYB561 CCAGTCTCCTCTAATGCTCAGATTTCCCATAGTTGGCTTT  918 RRG10H10 200895_s_at FKBP4GGACATGGGAAAAACCACTG CTATGCCATTTCTTCTCTCT  919 KKC2D2 200811_at CIRBPTGTGGCTTTTTTCCAACTCC GTGTGACGTTTCTGAGTGTA  920 QQG10H10 213110_s_atCOL4A5 GAATCCTCCTGTGGCCTCTG CTTGTACAGAACTGGGAAAC  921 SSE1F1 202009_atTWF2 CGGGCTGGCATTTTGTGACC CTTCCCTGTTGCTGTCCCTG  922 HHG7H7 202123_s_atABL1 CTGTGGTGGCTCCCCCTCTG CTTCTCGGGGTCCAGTGCAT  923 IIA10B10 201743_atCD14 CTGACGAGCTGCCCGAGGTG GATAACCTGACACTGGACGG  924 AAA8B8 203494_s_atCEP57 AAGTGAGAAACAGTGCTCTG GTGACATGATAAATATATGT  925 SSC6D6 221856_s_atFAM63A GTTTCTGGTTCTCAACTCCC OGTCCCTGAATAGTCACACG  926 UUA1B1 218695_atEXOSC4 GGCAGATGGTGGGACCTATG CAGCTTGTGTGAATGCAGCC  927 NNA10B10 201323_atEBNA1BP2 GAAAGGGTCAAATAAGAGAC CTGGAAAACGAACAAGAGAG  928 GGGE2F2203358_s_at EZH2 TCGAAAGAGAAATGGAAATC CCTTGACATCTGCTACCTCC  929 KKG12H12207515_s_at POLR1C AAGCTAAAGAAGGTTGTGAG GCTTGCCCGGGTTCGAGATC  930 PPC5D5202726_at LIG1 CCCTCGGTTTATTCGAGTCC GTGAAGACAAGCAGCCGGAG  931 LG1H1212875_s_at C2CD2 CGGAAAGGTTTGGCCTGACG CTGGAGTGCGGTGATGAACT  932 XG3H3218093_s_at ANKRD10 TGGATTTATTGTTTTTATTC CACACTTCCTACTTGGTCTC  933QC10D10 207059_at DDX41 CTGGCTGCCTGTTCCCTGTG CTCTTCAGAATTACTGTTTT  934KKG4H4 218421_at CERK AAGTCTGAGTGAAAGGATGG CCTCATTCTCTTTCTAATCT  935QC4D4 209380_s_at ABCC5 AGACCTACCTCAGGTTGCTG GTTGCTGTGTGGTTTGGTGT  936LLG9H9 202963_at RFX5 GTTCTGTGGTCAGGCGGCAC CAATGAGAAAGGAATGCAGA  937QE12F12 201944_at HEXB AGCTGCACAACCTCTTTATG CTGGATATTGTAACCATGAG  938ZA2B2 200915_x_at KTN1 TAAACCAACAGCTCACAAAG GAGAAAGAGCACTACCAGGT  939KKE5F5 212403_at UBE3B CCCATCCTAATTTTTATCAC CTGAAGGTTGGAACCAGTGA  940EEEG5H5 205398_s_at SMAD3 TCAAAGAGATTCGAATGACG GTAAGTGTTCTCATGAAGCA  941HHA2B2    121_at PAX8 TGTGCTTCCTGCAGCTCACG CCCACCAGCTACTGAAGGGA  942HHE3F3 212300_at TXLNA CAGCTTTTTTGTCTCCTTTG GGTATTCACAACAGCCAGGG  943NNE8F8 201087_at PXN TCTCCACTTTCACCCGCAGG CCTTACCGCTCTGTTTATAG  944LLLE8F8 201136_at PLP2 CAACAACATTCCCAGCAGAC CAACTCCCACCCCCTCTTTG  945HHHE5F5 212038_s_at VDAC1 TTCCCTAACCCTAATTGATG AGAGGCTCGCTGCTTGATGG  946XA11B11 909408_at KIF2C TTTAGTACAGCTATCTGCTG GCTCTAAACCTTCTACGCCT  947JJG2H2 204252_at CDK2 TGATCCCATTTTCCTCTGAC GTCCACCTCCTACCCCATAG  948RA12B12 204542_at ST6GALNAC2 TCCCATTAGAGATGTATCAC CACCTTGTCACCAACAGGAT 949 KKE2F2 202114_at SNX2 GACCCTCTTTGAATTAAGTG GACTGTGGCATGACATTCTG 950 FFG4H4 213152_s_at SFRS2B CATGCAGTGAGCACATCTAG CTGACGATAATCACACCTTT 951 LLE3F3 212651_at RHOBTB1 GGCAGTGGAAACACCAGATA GAAGATCTTAGGAGAGGCCC 952 YC8D8 202925_s_at PLAGL2 TAGCTGATTGTTCCCACTTG CACCTCTCCACCTTTGGCAC 953 TC10D10 205324_s_at FTSJ1 ACAACCCTGAAGACAACAAG GAAAGAAACCATGAAAGTCT 954 QQG7H7 208898_at ATP6V1D TCAGGCCAATTACTGTGGAG CAGCTTTCATTCCTACCCAC 955 LLE1F1 218399_s_at CDCA4 TAGATCACAGGCACCAGTTG GTCTTCAGGGACCTCATAGC 956 QQE3F3 205031_at EFNB3 TGGCCACCTCAATCACCAGC CAAGATGGTTGCTTTGTCCA 957 NNC3D3 205691_at SYNGR3 GACACCAGCCCTGTCCTAGC CCTTCAGTAAGACCTTGCCA 958 TTE10F10 221514_at UTP14A ATGCTCTGTAGATTGAGTTG CTGGAGGAGTGACAGCCAGG 959 MC12D12 203675_at NUCB2 AACGTCAGCATGATCAACTG GAGGCTCAGAAGCTGGAATA 960 TC7D7 202119_s_at CPNE3 GTAAATTCAGGGCCCCATTG CTACTTATGCCATATTTGGA 961 OOG7H7 200783_s_at STMN1 TTCTCTGCCCCGTTTCTTGC CCCAGTGTGGTTTGCATTGT 962 PPA3B3 202413_s_at USP1 CTTGATTCACTTAGAAGTGT CTCAGAAAACCTGGACAGTT 963 FFA8B8 218140_x_at SRPRB GGTCTAGTGTGTTCTTAGTG GTTATACTGGGAAGTGTGTG 964 MA7B7 219352_at HERC6 GGAATGTACTTTCACTTTTG CTGCTTCACTGCCTTGTGCT 965 QQA10B10 212880_at WDR7 CAACCAAGGCCAGTAGAAAG CTATGGCTGCAAAACCCTGG 966 ZZC7D7 200848_at AHCYL1 ATGAACTGAGATCATAAAGG GCAACTGATGTGTGAAGAAA 967 UUA6B6 212536_at ATP11B ACCTGAGACACTGTGGCTGT CTAATGTAATCCTTTAAAAA 968 TG10H10 204489_s_at CD44 GCCAACCTTTCCCCCACCAG CTAAGGACATTTCCCAGGGT 969 MMC1D1 204781_s_at FAS AGAAAGTAGCTTTGTGACAT GTCATGAACCCATGTTTGCA 970 JJE5F5 205079_s_at MPDZ ACCCCTAGCTCACCTCCTAC TGTAAAGAGAATGCACTGGT 971 QQA3B3 205046_at CENPE GGCAAGGATGTGCCTGAGTG CAAAACTCAGTAGACTCCTC 972 TG1H1 206414_s_at ASAP2 GCATTTTGCATGCCATTCTC CATCAGATCTGGGATGATGG 973 NNC2D2 204610_s_at CCDC85B CTAGCGCTTAAGGAGCTCTGCCTGGCGCTGGGCGAAGAAT  974 NNA3B3 204756_at MAP2K5 GGCCATCCCCATACCTTCTGGTTTGAAGGCGCTGACACTG  975 QC9D9 202318_s_at SENP6 GGACACTTACTCAACAGAAGCACCTTTAGGCGAAGGAACA  976 HHHA11B11 218407_x_at NENFTTCTTGGGAGCGTGAGGCAG GAAGACACTAGGTGCTGAAT  977 OOC5D5 213190_at COG7TTACTGACCCCACCACACAC CGGACCACCAAGAGAGCCAG  978 XG9H9 203576_at BCAT2GCCAGCACTCGCCTCCCTAC CAATGACTCACCTGAAGTGC  979 OE3F3 201827_at SMARCD2GTTTTCAGGGAGCCTGTTAG GTGCCTCCTTCTTTTCTTTC  980 IIG12H12 203067_at PDHXTGGCCATTAACTTAGCAGTG GGACCTCACTTTTACAAGCA  981 OOA6B6 221560_at MARK4AAAGAAGAGGCGTGGGAATC CAGGCAGTGGTTTTTCCTTT  982 UA5B5 212737_at GM2AGTGGCCTCGACATCAAACTG CCTGGATTTTTCTACCACCC  983 AAAG6H6 204925_at CTNSCCAGGACGTGCCTCATACAT GACTTGAGCTTGTCAGTCCA  984 OA11B11 212717_at PLEKHM1GTCTTTGCAATGTATTGAAG GAATTGCTGCCGTGTGAGTT  985 IIG8H8 201200_at CREG1TTCAGCCAGGGACAAAATCC CCTCCCAAACCACTCTCCAC  986 MA6B6 209603_at GATA3GCTACCAGCGTGCATGTCAG CGACCCTGGCCCGACAGGCC  987 CCCE3F3 219061_s_at LAGE3CTGGAAAGCTGAAGACTGTC GCCTGCTCCGAATTTCCGTC  988 CCCA2B2 204679_at KCNK1TAGGAGGAGAATACTTGAAG CAGTATGCTGCTGTGGTTAG  989 XC11D11 201931_at ETFAGCTTTGTTCCCAATGACATG CAAGTTGGACAGACGGGAAA  990 ZC1D1 202398_at AP3S2CACTGCTCAATACAGCCTCC GATCCTCACTCTTGAAAGCT  991 WE4F4 209307_at SWAP70TCACATGTGGACCTTGATAC GACTAAGCGGTTACATATGT  992 BBBA9B9 205919_at HBE1TGGCTACTCACTTTGGCAAG GAGTTCACCCCTGAAGTGCA  993 YYG8H8 208290_s_at EIF5TGGAGTGTGTGGTAGCAATG CATCAAGCTCAGCTTATCTC  994 NC4D4 218679_s_at VPS28CAACTCACTGTCTGCAGCTG CCTGTCTGGTGTCTGTCTTT  995 OOA12B12 201788_at DDX42GCTCTGAAGATTCCCAGAAG CCACAAGGATTGAAGGGAAA  996 ZC3D3 218149_s_at ZNF395GACGTCTGTGGCCAAGCGAG GTCTCAGGTGCAAAGCAAAA  997 BBBG3H3 211330_s_at HFETCGTCTGAAAGAGGAAGCAG CTATGAAGGCCAAAACAGAG  998 FFFG2H2 208763_s_atTSC22D3 AACCAGCCTTGGGAGTATTG ACTGGTCCCTTACCTCTTAT  999 TA3B3 203232_s_atATXN1 GCACTACCAGACTGACATGG CCAGTACAGAGGAGAACTAG 1000 TA9B9 202655_atARMET CTGGAGCTTTCCTGATGATG CTGGCCCTACAGTACCCCCA

The invention is further described by the following numbered paragraphs:

-   1. A method for making a transcriptome-wide mRNA-expression    profiling platform using sub-transcriptome numbers of transcript    measurements comprising:-   a) providing:-   i) a first library of transcriptome-wide mRNA-expression data from a    first collection of biological samples;-   ii) a second collection of biological samples;-   iii) a second library of transcriptome-wide mRNA-expression data    from said second collection of biological samples;-   iv) a device capable of measuring transcript expression levels;-   b) performing computational analysis on said first library such that    a plurality of transcript clusters are created, wherein the number    of said clusters is substantially less than the total number of all    transcripts;-   c) identifying a centroid transcript within each of said plurality    of transcript clusters, thereby creating a plurality of centroid    transcripts, said remaining transcripts being non-centroid    transcripts;-   d) measuring the expression levels of at least a portion of    transcripts from said second collection of biological samples with    said device, wherein said portion of transcripts comprise    transcripts identified as said centroid transcripts from said first    library;-   e) determining the ability of said measurements of the expression    levels of said centroid transcripts to infer the levels of at least    a portion of transcripts from said second library, wherein said    portion is comprised of non-centroid transcripts;-   f) selecting said centroid transcripts whose said expression levels    have said ability to infer the levels of said portion of    non-centroid transcripts.-   2. The method of Paragraph 1, wherein said plurality of centroid    transcripts is approximately 1000 centroid transcripts.-   3. The method of Paragraph 1, wherein said device is selected from    the group consisting of a microarray, a bead array, a liquid array,    and a nucleic-acid sequencer.-   4. The method of Paragraph 1, wherein said computational analysis    comprises cluster analysis.-   5. The method of Paragraph 1, wherein said method further comprises    repeating steps c) to f) until validated centroid transcripts for    each of said plurality of transcript clusters are identified.-   6. The method of Paragraph 1, wherein said plurality of clusters of    transcripts are orthogonal.-   7. The method of Paragraph 1, wherein said plurality of clusters of    transcripts are non-overlapping.-   8. The method of Paragraph 1, wherein said determining involves a    correlation between said expression levels of said centroid    transcripts and said expression levels of said non-centroid    transcripts.-   9. The method of Paragraph 1, wherein expression levels of a set of    substantially invariant transcripts are additionally measured with    said device in said second collection of biological samples.-   10. The method of Paragraph 9, wherein said measurements of said    centroid transcripts made with said device, and said mRNA-expression    data from said first and second libraries, are normalized with    respect to the expression levels of a set of substantially invariant    transcripts.-   11. A method for identifying a subpopulation of predictive    transcripts within a transcriptome, comprising:-   a) providing:-   i) a first library of transcriptome-wide mRNA-expression data from a    first collection of biological samples;-   ii) a second collection of biological samples;-   ii) a second library of transcriptome-wide mRNA-expression data from    said second collection of biological samples;-   iii) a device capable of measuring transcript expression levels;-   b) performing computational analysis on said first library such that    a plurality of transcript clusters are created, wherein the number    of said clusters is less than the total number of all transcripts in    said first library;-   c) identifying a centroid transcript within each of said transcript    clusters thereby creating a plurality of centroid transcripts, said    remaining transcripts being non-centroid transcripts;-   d) processing transcripts from said second collection of biological    samples on said device so as to measure expression levels of said    centroid transcripts, and-   e) determining which of said plurality of centroid transcripts    measured on said device predict the levels of said non-centroid    transcripts in said second library of transcriptome-wide data.-   12. The method of Paragraph 11, wherein said plurality of centroid    transcripts is approximately 1000 centroid transcripts.-   13. The method of Paragraph 11, wherein said device is selected from    the group consisting of a microarray, a bead array, a liquid array,    and a nucleic-acid sequencer.-   14. The method of Paragraph 11, wherein said computational analysis    comprises cluster analysis.-   15. The method of Paragraph 11, wherein said determining involves a    correlation between said centroid transcript and said non-centroid    transcript.-   16. The method of Paragraph 11, wherein said method further    comprises repeating steps c) to e).-   17. A method for identifying a subpopulation of approximately 1000    predictive transcripts within a transcriptome, comprising:-   a) providing:-   i) a first library of transcriptome-wide mRNA-expression data from a    first collection of biological samples representing greater than    1000 different transcripts, and-   ii) transcripts from a second collection of biological samples;-   b) performing computational analysis on said first library such that    a plurality of clusters of transcripts are created, wherein the    number of said clusters is approximately 1000 and less than the    total number of all transcripts in said first library;-   c) identifying a centroid transcript within each of said transcript    clusters, said remaining transcripts being non-centroid transcripts;-   d) processing the transcripts from said second collection of    biological samples so as to measure the expression levels of    non-centroid transcripts, so as to create first measurements, and    expression levels of centroid transcripts, so as to create second    measurements; and-   e) determining which centroid transcripts based on said second    measurements predict the levels of said non-centroid transcripts,    based on said first measurements, thereby identifying a    subpopulation of predictive transcripts within a transcriptome.-   18. The method of Paragraph 17, wherein said method further    comprises a device capable of measuring the expression levels of    said centroid transcripts.-   19. The method of Paragraph 18, wherein said device is capable of    measuring the expression levels of approximately 1000 of said    centroid transcripts.-   20. The method of Paragraph 17, wherein said computational analysis    comprises cluster analysis.-   21. The method of Paragraph 17, wherein said determining involves a    correlation between said centroid transcript and said non-centroid    transcript.-   22. The method of Paragraph 17, wherein said method further    comprises repeating steps c) to e).-   23. A method for predicting the expression level of a first    population of transcripts by measuring the expression level of a    second population of transcripts, comprising:-   a) providing:-   i) a first heterogeneous population of transcripts comprising a    second heterogeneous population of transcripts, said second    population comprising a subset of said first population,-   ii) an algorithm capable of predicting the level of expression of    transcripts within said first population which are not within said    second population, said predicting based on the measured level of    expression of transcripts within said second population;-   b) processing said first heterogeneous population of transcripts    under conditions such that a plurality of different templates    representing only said second population of transcripts is created;-   c) measuring the amount of each of said different templates to    create a plurality of measurements; and-   d) applying said algorithm to said plurality of measurements,    thereby predicting the level of expression of transcripts within    said first population which are not within said second population.-   24. The method of Paragraph 23, wherein said first heterogenous    population of transcripts comprise a plurality of non-centroid    transcripts.-   25. The method of Paragraph 23, wherein said second heterogenous    population of transcripts comprises a plurality of centroid    transcripts.-   26. The method of Paragraph 23, wherein said method further    comprises a device capable of measuring the amount of approximately    1000 of said different templates.-   27. The method of Paragraph 26, wherein said device is selected from    the group consisting of a microarray, a bead array, a liquid array,    and a nucleic-acid sequencer.-   28. The method of Paragraph 23, wherein said algorithm involves a    dependency matrix.-   29. A method of assaying gene expression, comprising:-   a) providing:-   i) approximately 1000 different barcode sequences;-   ii) approximately 1000 beads, each bead comprising a homogeneous set    of nucleic-acid probes, each set complementary to a different    barcode sequence of said approximately 1000 barcode sequences;-   iii) a population of more than 1000 different transcripts, each    transcript comprising a gene-specific sequence;-   iv) an algorithm capable of predicting the level of expression of    unmeasured transcripts;-   b) processing said population of transcripts to create approximately    1000 different templates, each template comprising one of said    approximately 1000 barcode sequences operably associated with a    different gene-specific sequence, wherein said approximately 1000    different templates represents less than the total number of    transcripts within said population;-   c) measuring the amount of each of said approximately 1000 different    templates to create a plurality of measurements; and-   d) applying said algorithm to said plurality of measurements,    thereby predicting the level of expression of unmeasured transcripts    within said population.-   30. The method of Paragraph 29, wherein said method further    comprises a device capable of measuring the amount of each of said    approximately 1000 different templates.-   31. The method of Paragraph 29, wherein said beads are optically    addressed.-   32. The method of Paragraph 29, wherein said processing comprises    ligation-mediated amplification.-   33. The method of Paragraph 31, wherein said measuring comprises    detecting said optically addressed beads.-   34. The method of Paragraph 31, wherein said measuring comprises    hybridizing said approximately 1000 different templates to said    approximately 1000 beads through said nucleic-acid probes    complementary to said approximately 1000 barcode sequences.-   35. The method of Paragraph 31, wherein said measuring comprises a    flow cytometer.-   36. The method of Paragraph 29, wherein said algorithm involves a    dependency matrix.-   37. A composition comprising an amplified nucleic acid sequence,    wherein said sequence comprises at least a portion of a cluster    centroid transcript sequence and a barcode sequence, wherein said    composition further comprises an optically addressed bead, and    wherein said bead comprises a capture probe nucleic-acid sequence    hybridized to said barcode.-   38. The composition of Paragraph 37, wherein said barcode sequence    is at least partially complementary to said capture probe nucleic    acid.-   39. The composition of Paragraph 37, wherein said amplified    nucleic-acid sequence is biotinylated.-   40. The composition of Paragraph 37, wherein said optically    addressable bead is detectable with a flow cytometric system.-   41. The composition of Paragraph 40, wherein said flow cytometric    system discriminates between approximately 500-1000 optically    addressed beads.-   42. A method for creating a genome-wide expression profile,    comprising:-   a) providing:-   i) a plurality of genomic transcripts derived from a biological    sample;-   ii) a plurality of centroid transcripts comprising at least a    portion of said genomic transcripts, said remaining genomic    transcripts being non-centroid transcripts;-   b) measuring the expression level of said plurality of centroid    transcripts;-   c) inferring the expression levels of said non-centroid transcripts    from said centroid transcript expression levels, thereby creating a    genome-wide expression profile.-   43. The method of Paragraph 42, wherein said plurality of centroid    transcripts comprise approximately 1,000 transcripts.-   44. The method of Paragraph 42, wherein said measuring comprises a    device selected from the group consisting of a microarray, a bead    array, a liquid array, and a nucleic-acid sequencer.-   45. The method of Paragraph 42, wherein said inferring involves a    dependency matrix.-   46. The method of Paragraph 42, wherein said genome-wide expression    profile identifies said biological sample as diseased.-   47. The method of Paragraph 42, wherein said genome-wide expression    profile identifies said biological sample as healthy.-   48. The method of Paragraph 42, wherein said genome-wide expression    profile provides a functional readout of the action of a    perturbagen.-   49. The method of Paragraph 42, wherein said genome-wide expression    profile comprises an expression profile suitable for use in a    connectivity map.-   50. The method of Paragraph 49, wherein said expression profile is    compared with query signatures for similarities.-   51. The method of Paragraph 42, wherein said genome-wide expression    profile comprises a query signature compatible with a connectivity    map.-   52. The method of Paragraph 51, wherein said query signature is    compared with known genome-wide expression profiles for    similarities.-   53. A kit, comprising:-   a) a first container comprising a plurality of centroid transcripts    derived from a transcriptome;-   b) a second container comprising buffers and reagents compatible    with measuring the expression level of said plurality of centroid    transcripts within a biological sample;-   c) a set of instructions for inferring the expression level of    non-centroid transcripts within said biological sample, based upon    the expression level of said plurality of centroid transcripts.-   54. The kit of Paragraph 53, wherein said plurality of centroid    transcripts is approximately 1,000 transcripts.-   55. A method for making a transcriptome-wide mRNA-expression    profile, comprising:-   a) providing:-   i) a composition of validated centroid transcripts numbering    substantially less than the total number of all transcripts;-   ii) a device capable of measuring the expression levels of said    validated centroid transcripts;-   iii) an algorithm capable of substantially calculating the    expression levels of transcripts not amongst the set of said    validated centroid transcripts from expression levels of said    validated centroid transcripts measured by said device and    transcript cluster information created from a library of    transcriptome-wide mRNA-expression data from a collection of    biological samples; and-   iv) a biological sample;-   b) applying said biological sample to said device whereby expression    levels of said validated centroid transcripts in said biological    sample are measured;-   c) applying said algorithm to said measurements thereby creating a    transcriptome-wide mRNA expression profile.-   56. The method of Paragraph 55, wherein said validated centroid    transcripts comprise approximately 1,000 transcripts.-   57. The method of Paragraph 55, wherein said device is selected from    the group consisting of a microarray, a bead array, a liquid array,    and a nucleic-acid sequencer.-   58. The method of Paragraph 55, wherein expression levels of a set    of substantially invariant transcripts are additionally measured in    said biological sample.-   59. The method of Paragraph 55, wherein said expression levels of    said validated centroid transcripts are normalized with respect to    said expression levels of said invariant transcripts.-   101. A method, comprising:-   a) providing:-   i) a sample comprising a plurality of analytes;-   ii) a plurality of solid substrate populations, wherein each of said    solid substrate populations comprise a plurality of subsets, and    wherein each subset is present in an unequal proportion from every    other subset in the same solid substrate population;-   iii) a plurality of capture probes capable of attaching to said    plurality of analytes, wherein each subset comprises a different    capture probe; and-   vi) a means for detecting said plurality of subsets that is capable    of creating a multimodal intensity distribution pattern;-   b) detecting said plurality of subsets with said means, wherein a    multimodal intensity distribution pattern is created;-   c) identifying said plurality of analytes from said multimodal    distribution pattern.-   102. The method of Paragraph 101, wherein said sample may be    selected from the group comprising a biological sample, a soil    sample, or a water sample.-   103. The method of Paragraph 101, wherein said plurality of analytes    may be selected from the group comprising nucleic acids, proteins,    peptides, biological receptors, enzymes, antibodies, polyclonal    antibodies, monoclonal antibodies, or Fab fragments.-   104. The method of Paragraph 101, wherein said solid substrate    population comprises a bead-set population.-   105. The method of Paragraph 101, wherein said unequal proportions    comprise two subsets in an approximate ratio of 1.25:0.75.-   106. The method of Paragraph 101, wherein said unequal proportions    comprise three subsets in an approximate ratio of 1.25:1.00:0.75.-   107. The method of Paragraph 101, wherein said unequal proportions    comprise four subsets in an approximate ratio of    1.25:1.00:0.75:0.50.-   108. The method of Paragraph 101, wherein said unequal proportions    comprise five subsets in an approximate ratio of    1.50:1.25:1.00:0.75:0.50.-   109. The method of Paragraph 101, wherein said unequal proportions    comprise six subsets in an approximate ratio of    1.75:1.50:1.25:1.00:0.75:0.50.-   110. The method of Paragraph 101, wherein said unequal proportions    comprise seven subsets in an approximate ratio of    2.00:1.75:1.50:1.25:1.00:0.75:0.50.-   111. The method of Paragraph 101, wherein said unequal proportions    comprise eight subsets in an approximate ratio of    2.00:1.75:1.50:1.25:1.00:0.75:0.50:0.25.-   112. The method of Paragraph 101, wherein said unequal proportions    comprise nine subsets in an approximate ratio of    2.25:2.00:1.75:1.50:1.25:1.00:0.75:0.50:0.25.-   113. The method of Paragraph 101, wherein said unequal proportions    comprise ten subsets in an approximate ratio of    2.50:2.25:2.00:1.75:1.50:1.25:1.00:0.75:0.50:0.25.-   114. A method, comprising:-   a) providing;-   i) a solid substrate population comprising a first subset and a    second subset, wherein the first subset is present in a first    proportion and the second subset is present in a second proportion;-   ii) a first analyte attached to said first subset;-   iii) a second analyte attached to said second subset; and-   vi) a means for detecting said first subset and second subset that    is capable of creating a multimodal intensity distribution pattern;-   b) detecting said first subset and said second subset with said    means, wherein a multimodal intensity distribution pattern is    created; and-   c) identifying said first analyte and said second analyte from said    multimodal distribution pattern.-   115. The method of Paragraph 114, wherein said solid substrate    population comprises a label.-   116. The method of Paragraph 115, wherein said label comprises a    mixture of at least two different fluorophores.-   117. The method of Paragraph 114, wherein said first proportion is    different from said second proportion.-   118. The method of Paragraph 114, wherein said first analyte is    attached to said first subset with a first capture probe.-   119. The method of Paragraph 114, wherein said second analyte is    attached to said second subset with a second capture probe.-   120. The method of Paragraph 114, wherein said multimodal intensity    distribution pattern comprises a first peak corresponding to said    first subset.-   121. The method of Paragraph 114, wherein said multimodal intensity    distribution pattern comprises a second peak corresponding to said    second subset.-   122. A method, comprising:-   a) providing:-   i) a solid substrate population comprising a plurality of subsets;-   ii) a sample comprising a plurality of analytes, wherein at least    one portion of said plurality of analytes comprise related analytes;    and-   iii) a means for detecting said subsets that is capable of creating    a multimodal intensity distribution pattern;-   b) attaching each of said related analyte portions to one of said    plurality of subsets;-   c) detecting said plurality of subsets with said means, wherein a    multimodal intensity distribution pattern is created; and-   d) identifying said related analytes from said multimodal    distribution pattern.-   123. The method of Paragraph 122, wherein said related analytes    comprise linked genes.-   124. A method, comprising:-   a) providing:-   i) a solid substrate population comprising a plurality of subsets;-   ii) a sample comprising a plurality of analytes, wherein at least    one portion of the plurality of analytes comprise rare event    analytes; and-   iii) a means for detecting said subsets that is capable of creating    a multimodal intensity distribution pattern;-   b) attaching a portion of said plurality of analytes comprising one    or more of the rare event analytes to one of the plurality of    subsets;-   c) detecting said plurality of subsets with said means, wherein a    multimodal intensity distribution pattern is created; and-   d) determining if said rare event analytes occur in said multimodal    distribution pattern.-   125. The method of Paragraph 124, wherein said rare event analyte    portion is present in approximately less than 0.01% of said sample.-   126. The method of Paragraph 124, wherein said rare event analyte    comprises a small molecule or drug.-   127. The method of Paragraph 124, wherein said rare event analyte    comprises a nucleic acid mutation.-   128. The method of Paragraph 124, wherein said rare event analyte    comprises a diseased cell.-   129. The method of Paragraph 124, wherein said rare event analyte    comprises an autoimmune antibody.-   130. The method of Paragraph 124, wherein said rare event analyte    comprises a microbe.-   131. A method, comprising:-   a) providing:-   i) a solid substrate population comprising a plurality of subsets;-   ii) a sample comprising a first labeled analyte and a second labeled    analyte; and-   iii) a means for detecting said subsets that is capable of creating    a multimodal intensity distribution pattern;-   b) attaching said first and second labeled analytes in an unequal    proportion to one of said plurality of subsets;-   c) detecting said plurality of subsets with said means, wherein a    multimodal intensity distribution pattern is created; and-   d) identifying said first and second labeled analytes from said    multimodal distribution pattern.-   132. The method of Paragraph 131, wherein said first labeled analyte    comprises a normal cell.-   133. The method of Paragraph 131, wherein said second labeled    analyte comprises a tumor cell.-   134. The method of Paragraph 131, wherein said multimodal intensity    distribution pattern comprises a first peak corresponding to said    first labeled analyte.-   135. The method of Paragraph 131, wherein said multimodal intensity    distribution pattern comprises a second peak corresponding to said    second labeled analyte.-   136. The method of Paragraph 131, wherein said unequal proportion is    equivalent to a ratio of said first and second peaks.

Having thus described in detail preferred embodiments of the presentinvention, it is to be understood that the invention defined by theabove paragraphs is not to be limited to particular details set forth inthe above description as many apparent variations thereof are possiblewithout departing from the spirit or scope of the present invention.

What is claimed is: 1.-15. (canceled)
 16. A method of assaying geneexpression, comprising: a) providing: i) a plurality of differentbarcode sequences; ii) a plurality of a solid support, each solidsupport comprising a homogeneous set of nucleic acid probes, each setcomplementary to a different barcode sequence of said plurality ofbarcode sequences; iii) a population of more than 1000 differenttranscripts, each transcript comprising a gene-specific sequence; iv) analgorithm capable of predicting the level of expression of unmeasuredtranscripts; b) processing said population of transcripts to create aplurality of different templates, each template comprising one of saidplurality of barcode sequences operably associated with a differentgene-specific sequence, wherein said plurality of different templatesrepresents less than the total number of transcripts within saidpopulation; c) measuring the amount of each of said plurality ofdifferent templates to create a plurality of measurements; and d)applying said algorithm to said plurality of measurements, therebypredicting the level of expression of unmeasured transcripts within saidpopulation.
 17. The method according to claim 16, wherein said methodfurther comprises a device capable of measuring the amount of each ofsaid plurality of different templates.
 18. The method according to claim16, wherein said solid support are optically addressed.
 19. The methodaccording to claim 16, wherein said processing comprisesligation-mediated amplification.
 20. The method according to claim 18,wherein said measuring comprises detecting said optically addressedsolid support.
 21. The method according to claim 20, wherein saidmeasuring comprises hybridizing said plurality of different templates tosaid plurality of a solid support through said nucleic-acid probescomplementary to said plurality of barcode sequences.
 22. The methodaccording to claim 16, wherein said measuring comprises a flowcytometer.
 23. A method according to claim 1, wherein expression levelsof a set of substantially invariant transcripts are additionallymeasured.
 24. A method according to claim 23, wherein said expressionlevels of said unmeasured transcripts are normalized with respect tosaid expression levels of said invariant transcripts.
 25. A kit for usein the method of claim 16, the kit comprising: a) the population ofdifferent transcripts; b) the plurality of different bar code sequences;c) the plurality of solid support; and d) a set of instructions forassaying the expression level of said transcripts.
 26. The method ofclaim 16, wherein the solid support is a bead, membrane, filter, orslide.
 27. The method of claim 16, wherein the solid support is made ofa material selected from the group consisting of polystyrene,polycarbonate, polypropylene, nylon, glass, dextran, chitin, sand,pumice, agarose, polysaccharides, dendrimers, buckyballs,polyacrylamide, silicon, and rubber.
 28. The method of claim 16, whereinthe plurality of different barcode sequences is approximately 500 toapproximately 1000 different barcode sequences, the plurality of solidsupport is approximately 500 to approximately 1000 solid supports, thepopulation of different transcripts is approximately 500 toapproximately 1000 different transcripts, and/or the plurality ofdifferent templates is approximately 500 to approximately 1000 differenttemplates.
 29. The method of claim 16, wherein the plurality ofdifferent barcode sequences is approximately 1000 different barcodesequences, the plurality of solid support is approximately 1000 solidsupports, the population of different transcripts is approximately 1000different transcripts, and/or the plurality of different templates isapproximately 1000 different templates.